https://arxiv.org/api/HdHS3UP/wDYevZu3/oxOCP6sjH02026-03-21T01:03:51Z1136915015http://arxiv.org/abs/2603.05459v1DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates2026-03-05T18:30:10ZThe process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.2026-03-05T18:30:10ZKlaywert Danillo Ferreira de SouzaDavid Eduardo PereiraCláudio E. C. CampeloLarissa Lucena Vasconceloshttp://arxiv.org/abs/2603.05439v1O^3-LSM: Maximizing Disaggregated LSM Write Performance via Three-Layer Offloading2026-03-05T18:00:51ZLog-Structured Merge-tree-based Key-Value Stores (LSM-KVS) have been optimized and redesigned for disaggregated storage via techniques such as compaction offloading to reduce the network I/Os between compute and storage. However, the constrained memory space and slow flush at the compute node severely limit the overall write throughput of existing optimizations. In this paper, we propose O3-LSM, a fundamental new LSM-KVS architecture, that leverages the shared Disaggregated Memory (DM) to support a three-layer offloading, i.e., memtable Offloading, flush Offloading, and the existing compaction Offloading. Compared to the existing disaggregated LSM-KVS with compaction offloading only, O3-LSM maximizes the write performance by addressing the above issues.
O3-LSM first leverages a novel DM-Optimized Memtable to achieve dynamic memtable offloading, which extends the write buffer while enabling fast, asynchronous, and parallel memtable transmission. Second, we propose Collaborative Flush Offloading that decouples the flush control plane from execution and supports memtable flush offloading at any node with dedicated scheduling and global optimizations. Third, O3-LSM is further improved with the Shard-Level Optimization, which partitions the memtable into shards based on disjoint key-ranges that can be transferred and flushed independently, unlocking parallelism across shards. Besides, to mitigate slow lookups in the disaggregated setting, O3-LSM also employs an adaptive Cache-Enhanced Read Delegation mechanism to combine a compact local cache with DM-assisted memtable delegated read. Our evaluation shows that O3-LSM achieves up to 4.5X write, 5.2X range query, and 1.8X point lookup throughput improvement, and up to 76% P99 latency reduction compared with Disaggregated-RocksDB, CaaS-LSM, and Nova-LSM.2026-03-05T18:00:51ZAccepted to SIGMOD 2026 as a full research paperQi LinGangqi HuangTe GuoChang GuoViraj ThakkarZichen ZhuJianguo WangZhichao Caohttp://arxiv.org/abs/2601.13117v2The Case for Cardinality Lower Bounds2026-03-05T17:32:26ZDespite decades of research, cardinality estimation remains the optimizer's Achilles heel, with industrial-strength systems exhibiting a systemic tendency toward underestimation. At cloud scale, this is a severe production vulnerability: in Microsoft's Fabric Data Warehouse (DW), a mere 0.05% of extreme underestimates account for 95% of all CPU under-allocation, causing preventable slowdowns for thousands of queries daily. Yet recent theoretical work on provable upper bounds only corrects overestimation, leaving the more harmful problem of underestimation unaddressed. We argue that closing this gap is an urgent priority for the database community.
As a vital step toward this goal, we introduce xBound, the first theoretical framework for computing provable join size lower bounds. By clipping the optimizer's estimates from below, xBound offers strict mathematical safety nets demanded by production systems - using only a handful of lightweight base table statistics. We demonstrate xBound's practical impact on Fabric DW: on the StackOverflow-CEB benchmark, it corrects 23.6% of Fabric DW's underestimates, yielding end-to-end query speedups of up to 20.1x, demonstrating that even a first step toward provable lower bounds can deliver meaningful production gains and motivating the community to further pursue this critical, open direction.2026-01-19T15:01:26Zv2: added probabilistic lower bounds + e2e evaluation on Fabric DWMihail StoianTiemo BangHangdong ZhaoJesús Camacho-RodríguezYuanyuan TianAndreas Kipfhttp://arxiv.org/abs/2603.05405v1Bala-Join: An Adaptive Hash Join for Balancing Communication and Computation in Geo-Distributed SQL Databases2026-03-05T17:31:15ZShared-nothing geo-distributed SQL databases, such as CockroachDB, are increasingly vital for enterprise applications requiring data resilience and locality. However, we encountered significant performance degradation at the customer side, especially when their deployments span multiple data centers over a Wide Area Network (WAN). Our investigation identifies the bottleneck in the performance of the Distributed Hash Join (Dist-HJ) algorithm, which is contingent upon a crucial balance between communication overhead and computational load. This balance is severely disrupted when processing skewed data from real-world customer workloads, leading to the observed performance decline. To tackle this challenge, we introduce Bala-Join, an adaptive solution to balance the computation and network load in Dist-HJ execution. Our approach consists of the Balanced Partition and Partial Replication (BPPR) algorithm and a distributed online skewed join key detector. The former achieves balanced redistribution of skewed data through a multicast mechanism to improve computational performance and reduce network overhead. The latter provides real-time skewed join key information tailored to BPPR. Furthermore, an Active-Signaling and Asynchronous-Pulling (ASAP) mechanism is incorporated to enable efficient, real-time synchronization between the detector and the redistribution process with minimal overhead. Empirical study shows that Bala-Join outperforms the popular Dist-HJ solutions, increasing throughput by 25%-61%.2026-03-05T17:31:15Z14Pages, 8 figuresWenlong SongHui LiBingying ZhaiJinxin YangPinghui WangLuming SunMing LiJiangtao Cuihttp://arxiv.org/abs/2603.06703v1The Fifth Graph Normal Form (5GNF): A Trait-Based Framework for Metadata Normalization in Property Graphs2026-03-05T14:32:55ZGraph databases are widely used in systems that manage rich metadata, yet current modelling practices often embed descriptive attributes directly in nodes, leading to redundancy and inconsistent semantics. This paper introduces the Fifth Graph Normal Form (5GNF), a trait-based normalization framework for property graphs that represents recurring metadata as canonical Trait Nodes connected through HAS_TRAIT relationships. We formalize trait functional dependencies (tFDs) and present the TraitExtraction5GNF algorithm for identifying and extracting reusable traits. The approach is implemented in Neo4j and evaluated using the widely used Northwind dataset, which contains substantial duplication in location and shipping metadata. The normalization process externalizes recurring metadata into shared traits, removes thousands of redundant attribute instances, reduces schema complexity, and simplifies analytical queries. Experimental results indicate that the normalized model maintains competitive performance while improving semantic clarity and reusability of metadata structures. These findings suggest that 5GNF provides a practical normalization framework for property graph schemas and contributes toward more consistent and maintainable graph data models.2026-03-05T14:32:55ZAccepted at ENASE 2026. 14 pages, 6 figures. Implementation and experimental scripts available at https://github.com/yahyazuh/5GNF-normalization-exampleYahya Sa'dVojtech MerunkaRenzo Angleshttp://arxiv.org/abs/2603.05180v1CRISP: Correlation-Resilient Indexing via Subspace Partitioning2026-03-05T13:47:43ZAs the dimensionality of modern learned representations increases to thousands of dimensions, the state-of-the-art Approximate Nearest Neighbor (ANN) indices exhibit severe limitations. Graph-based methods (e.g., HNSW) suffer from prohibitive memory consumption and routing degradation, while recent randomized quantization and learned rotation approaches (e.g., RaBitQ, OPQ) impose significant preprocessing overheads. We introduce CRISP, a novel framework designed for ANN search in very-high-dimensional spaces. Unlike rigid pipelines that apply expensive orthogonal rotations indiscriminately, CRISP employs a lightweight, correlation- aware adaptive strategy that redistributes variance only when necessary, effectively reducing the preprocessing complexity. We couple this adaptive mechanism with a cache-coherent Compressed Sparse Row (CSR) index structure. Furthermore, CRISP incorporates a multi-stage dual-mode query engine: a Guaranteed Mode that preserves rigorous theoretical lower bounds on recall, and an Optimized Mode that leverages rank-based weighted scoring and early termination to reduce query latency. Extensive evaluation on datasets of very high dimensionality (up to 4096) demonstrates that CRISP achieves state-of-the-art query throughput, low construction costs, and peak memory efficiency.2026-03-05T13:47:43ZDimitris DimitropoulosAchilleas MichalopoulosDimitrios TsitsigkosNikos Mamoulishttp://arxiv.org/abs/2603.05162v1RESYSTANCE: Unleashing Hidden Performance of Compaction in LSM-trees via eBPF2026-03-05T13:32:41ZThe development of high-speed storage devices such as NVMe SSDs has shifted the primary I/O bottleneck from hardware to software. Modern database systems also rely on kernel-based I/O paths, where frequent system call invocations and kernel-user space transitions lead to relatively large overheads and performance degradation. This issue is particularly pronounced in Log-Structured Merge-tree (LSM-tree)-based NoSQL databases. We identified that, in particular, the background compaction process generates a large number of read system calls, causing significant overhead. To address this problem, we propose RESYSTANCE, which leverages eBPF and io_uring to free compaction from system calls and unlock hidden performance potential. RESYSTANCE improves disk I/O efficiency during read operations via io uring and significantly reduces software stack overhead by handling compaction directly inside the kernel through eBPF. Moreover, RESYSTANCE minimizes user-kernel transitions by offloading key I/O routines into the kernel without modifying the LSM-tree structure or compaction algorithm. RESYSTANCE was extensively evaluated using db_bench, YCSB, and OLTP workloads. Compared to baseline RocksDB, it reduced the average number of system call invocations during compaction by 99% and shortened compaction time by 50%. Consequently, in write-intensive workloads, RESYSTANCE improved throughput by up to 75% and reduced the p99 latency by 40%.2026-03-05T13:32:41ZTo appear in IEEE International Conference on Data Engineering (ICDE) 2026Hongsu ByunSeungjae LeeHonghyeon YooMyoungjoon KimSungyong Parkhttp://arxiv.org/abs/2603.03065v2V3DB: Audit-on-Demand Zero-Knowledge Proofs for Verifiable Vector Search over Committed Snapshots2026-03-05T12:01:52ZDense retrieval services increasingly underpin semantic search, recommendation, and retrieval-augmented generation, yet clients typically receive only a top-$k$ list with no auditable evidence of how it was produced. We present V3DB, a verifiable, versioned vector-search service that enables audit-on-demand correctness checks for approximate nearest-neighbour (ANN) retrieval executed by a potentially untrusted service provider. V3DB commits to each corpus snapshot and standardises an IVF-PQ search pipeline into a fixed-shape, five-step query semantics. Given a public snapshot commitment and a query embedding, the service returns the top-$k$ payloads and, when challenged, produces a succinct zero-knowledge proof that the output is exactly the result of executing the published semantics on the committed snapshot -- without revealing the embedding corpus or private index contents. To make proving practical, V3DB avoids costly in-circuit sorting and random access by combining multiset equality/inclusion checks with lightweight boundary conditions. Our prototype implementation based on Plonky2 achieves up to $22\times$ faster proving and up to $40\%$ lower peak memory consumption than the circuit-only baseline, with millisecond-level verification time.
Github Repo at https://github.com/TabibitoQZP/zk-IVF-PQ.2026-03-03T15:04:09ZZipeng QiuWenjie QuJiaheng ZhangBinhang Yuanhttp://arxiv.org/abs/2603.04937v1FluxSieve: Unifying Streaming and Analytical Data Planes for Scalable Cloud Observability2026-03-05T08:36:59ZDespite many advances in query optimization, indexing techniques, and data storage, modern data platforms still face difficulties in delivering robust query performance under high concurrency and computationally intensive queries. This challenge is particularly pronounced in large-scale observability platforms handling high-volume, high-velocity data records. For instance, recurrent, expensive filtering queries at query time impose substantial computational and storage overheads in the analytical data plane. In this paper, we propose FluxSieve, a unified architecture that reconciles traditional pull-based query processing with push-based stream processing by embedding a lightweight in-stream precomputation and filtering layer directly into the data ingestion path. This avoids the complexity and operational burden of running queries in dedicated stream processing frameworks. Concretely, this work (i) introduces a foundational architecture that unifies streaming and analytical data planes via in-stream filtering and records enrichment, (ii) designs a scalable multi-pattern matching mechanism that supports concurrent evaluation and on-the-fly updates of filtering rules with minimal per-record overhead, (iii) demonstrates how to integrate this ingestion-time processing with two open-source analytical systems -- Apache Pinot as a Real-Time Online Analytical Processing (RTOLAP) engine and DuckDB as an embedded analytical database, and (iv) performs comprehensive experimental evaluation of our approach. Our evaluation across different systems, query types, and performance metrics shows up to orders-of-magnitude improvements in query performance at the cost of negligible additional storage and very low computational overhead.2026-03-05T08:36:59ZAdriano VogelSören HenningOtmar Ertlhttp://arxiv.org/abs/2603.03589v2stratum: A System Infrastructure for Massive Agent-Centric ML Workloads2026-03-05T07:47:35ZRecent advances in large language models (LLMs) transform how machine learning (ML) pipelines are developed and evaluated. LLMs enable a new type of workload, agentic pipeline search, in which autonomous or semi-autonomous agents generate, validate, and optimize complete ML pipelines. These agents predominantly operate over popular Python ML libraries and exhibit highly exploratory behavior. This results in thousands of executions for data profiling, pipeline generation, and iterative refinement of pipeline stages. However, the existing Python-based ML ecosystem is built around libraries such as Pandas and scikit-learn, which are designed for human-centric, interactive, sequential workflows and remain constrained by Python's interpretive execution model, library-level isolation, and limited runtime support for executing large numbers of pipelines. Meanwhile, many high-performance ML systems proposed by the systems community either target narrow workload classes or require specialized programming models, which limits their integration with the Python ML ecosystem and makes them largely ill-suited for LLM-based agents. This growing mismatch exposes a fundamental systems challenge in supporting agentic pipeline search at scale. We therefore propose stratum, a unified system infrastructure that decouples pipeline execution from planning and reasoning during agentic pipeline search. Stratum integrates seamlessly with existing Python libraries, compiles batches of pipelines into optimized execution graphs, and efficiently executes them across heterogeneous backends, including a novel Rust-based runtime. We present stratum's architectural vision along with an early prototype, discuss key design decisions, and outline open challenges and research directions. Finally, preliminary experiments show that stratum can significantly speed up large-scale agentic pipeline search up to 16.6x.2026-03-03T23:43:12ZArnab PhaniElias StraussSebastian Schelterhttp://arxiv.org/abs/2603.04905v1Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records2026-03-05T07:47:02ZAdministrative extracts are often exchanged as spreadsheets and may be read as reports in their own right during budgeting, workload review, and governance discussions. When an exported workbook becomes the reference snapshot for such decisions, the transformation can be checked by recomputation against a clearly identified input.
A deterministic, rule-governed, file-based workflow is implemented in cad_processor.py. The script ingests a Casual Academic Database (CAD) export workbook and aggregates inclusive on-costs and student counts into subject-year and school-year totals, from which it derives cost-per-student ratios. It writes a processed workbook with four sheets: Processing Summary (run record and counters), Trend Analysis (schoolyear cost-per-student matrix), Report (wide subject-level table), and Fuzzy Bands (per-year anchors, membership weights, and band labels). The run record includes a SHA-256 hash of the input workbook bytes to support snapshot-matched recomputation.
For within-year interpretation, the workflow adds a simple fuzzy banding layer that labels finite, positive school-year cost-per-student values as Low, Medium, or High. The per-year anchors are the minimum, median, and maximum of the finite, positive ratios. Membership weights are computed using left-shoulder, triangular, and right-shoulder functions, with deterministic tie-breaking in a fixed priority order (Medium, then Low, then High). These weights are treated as decision-support signals rather than probabilities.
A worked example provides a reproducible calculation of a band assignment from the reported anchors and ratios. Supplementary material includes a claim-to-evidence matrix, a reproducibility note, and a short glossary that links selected statements to code and workbook artefacts.2026-03-05T07:47:02Z34 pages, 3 figuresShane LeeStella Nghttp://arxiv.org/abs/2602.01712v2Mapping a Decade of Avian Influenza Research (2014-2023): A Scientometric Analysis from Web of Science2026-03-05T07:09:02ZThis scientometric study analyzes Avian Influenza research from 2014 to 2023 using bibliographic data from the Web of Science database. We examined publication trends, sources, authorship, collaborative networks, document types, and geographical distribution to gain insights into the global research landscape. Results reveal a steady increase in publications, with high contributions from Chinese and American institutions. Journals such as PLoS One and the Journal of Virology published the highest number of studies, indicating their influence in this field. The most prolific institutions include the Chinese Academy of Sciences and the University of Hong Kong, while the College of Veterinary Medicine at South China Agricultural University emerged as the most productive department. China and the USA lead in publication volume, though developed nations like the United Kingdom and Germany exhibit a higher rate of international collaboration. "Articles" are the most common document type, constituting 84.6% of the total, while "Reviews" account for 7.6%. This study provides a comprehensive view of global trends in Avian Influenza research, emphasizing the need for collaborative efforts across borders.2026-02-02T06:37:20Z24 pages, 7 figures, Research ArticleJournal of Health Information Research, 3(1), 1 - 24, 2026Muneer AhmadUndie Felicia NkatvAmrita SharmaGorrety Maria JumaNicholas KamogaJulirine Nakanwagi10.47524/jhir.v3i1.25http://arxiv.org/abs/2603.04799v1Beyond Linear LLM Invocation: An Efficient and Effective Semantic Filter Paradigm2026-03-05T04:37:15ZLarge language models (LLMs) are increasingly used for semantic query processing over large corpora. A set of semantic operators derived from relational algebra has been proposed to provide a unified interface for expressing such queries, among which the semantic filter operator serves as a cornerstone. Given a table T with a natural language predicate e, for each tuple in the relation, the execution of a semantic filter proceeds by constructing an input prompt that combines the predicate e with its content, querying the LLM, and obtaining the binary decision. However, this tuple-by-tuple evaluation necessitates a complete linear scan of the table, incurring prohibitive latency and token costs. Although recent work has attempted to optimize semantic filtering, it still does not break the linear LLM invocation barriers. To address this, we propose Clustering-Sampling-Voting (CSV), a new framework that reduces LLM invocations to sublinear complexity while providing error guarantees. CSV embeds tuples into semantic clusters, samples a small subset for LLM evaluation, and infers cluster-level labels via two proposed voting strategies: UniVote, which aggregates labels uniformly, and SimVote, which weights votes by semantic similarity. Moreover, CSV triggers re-clustering on ambiguous clusters to ensure robustness across diverse datasets. The results conducted on real-world datasets demonstrate that CSV reduces the number of LLM calls by 1.28-355x compared to the state-of-the-art approaches, while maintaining comparable effectiveness in terms of Accuracy and F1 score.2026-03-05T04:37:15ZNan HouKangfei ZhaoJiadong XieJeffrey Xu Yuhttp://arxiv.org/abs/2603.04785v1Towards a B+-tree with Fluctuation-Free Performance2026-03-05T04:05:39ZPerformance predictability is critical for modern DBMSs because index maintenance can trigger rare but severe I/O spikes. In a B or B+-tree with height H, node split propagation means the cost of a single insert can vary from H + 1 to 3H + 1 I/Os when splits reach the root, nearly a three times degradation. We formalize performance fluctuation as the gap between best- and worst-case insert behavior and introduce the notions of safe and critical nodes to capture when splits become unavoidable. We introduce the FFBtree, a B+-tree insert algorithm that preemptively splits some critical nodes, and prove that when navigating from root to leaf the insert algorithm will encounter at most one critical node that must be split, ensuring no split propagation can occur and producing fluctuation-free performance. Our implementation maintains critical-node metadata efficiently and integrates with optimistic lock coupling for concurrency. Experiments with simulated indexes show the FFBtree caps I/O fluctuation by eliminating split propagation and consistently reduces insert spikes compared to conventional baselines, and real-index experiments confirm comparable improvements.2026-03-05T04:05:39ZLu XingWalid G. Arefhttp://arxiv.org/abs/2603.04741v1CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics2026-03-05T02:26:36ZLarge pre-trained models (LMs) and Large Language Models (LLMs) are typically effective at capturing language semantics and contextual relationships. However, these models encounter challenges in maintaining optimal performance on tasks involving numbers. Blindly treating numerical or structured data as terms is inadequate -- their semantics must be well understood and encoded by the models. In this paper, we propose CONE, a hybrid transformer encoder pre-trained model that encodes numbers, ranges, and gaussians into an embedding vector space preserving distance. We introduce a novel composite embedding construction algorithm that integrates numerical values, ranges or gaussians together with their associated units and attribute names to precisely capture their intricate semantics. We conduct extensive experimental evaluation on large-scale datasets across diverse domains (web, medical, finance, and government) that justifies CONE's strong numerical reasoning capabilities, achieving an F1 score of 87.28% on DROP, a remarkable improvement of up to 9.37% in F1 over state-of-the-art (SOTA) baselines, and outperforming major SOTA models with a significant Recall@10 gain of up to 25%.2026-03-05T02:26:36ZGyanendra ShresthaAnna PyaytMichael Gubanov