https://arxiv.org/api/O7bFfl6wxB6jrywQvyjfwVFoS6E2026-03-20T14:33:14Z113694515http://arxiv.org/abs/2603.15099v1Completeness of Relational Algebra via Cylindric Algebra2026-03-16T10:50:45ZAn alternative proof of the completeness of relational algebra with respect to allowed formulas of first-order logic is presented. The proof relies on the well-known embedding of relational algebra into cylindric algebra, which makes it possible to establish completeness in a more algebraic way. Building on this proof, we present an alternative algorithm that produces a relational expression equivalent to a given allowed formula. The main motivation for the present work is to establish a proof of completeness suitable for generalisation to relational models handling incomplete or vague information.2026-03-16T10:50:45ZJan Laštovičkahttp://arxiv.org/abs/2403.03067v5MSO-Enumeration Over SLP-Compressed Unranked Forests2026-03-16T09:41:14ZWe study the problem of enumerating the answers to a query formulated in monadic second order logic (MSO) over an unranked forest F that is compressed by a straight-line program (SLP) D. Our main result states that this can be done after O(|D|) preprocessing and with output-linear delay (in data complexity). This is a substantial improvement over the previously known algorithms for MSO-evaluation over trees, since the compressed size |D| might be much smaller than (or even logarithmic in) the actual data size |F|, and there are linear time SLP-compressors that yield very good compressions on practical inputs. In particular, this also constitutes a meta-theorem in the field of algorithmics on SLP-compressed inputs: all enumeration problems on trees or strings that can be formulated in MSO-logic can be solved with linear preprocessing and output-linear delay, even if the inputs are compressed by SLPs. We also show that our approach can support vertex relabelling updates in time that is logarithmic in the uncompressed data. Our result extends previous work on the enumeration of MSO-queries over uncompressed trees and on the enumeration of document spanners over compressed text documents.2024-03-05T15:56:48Z64 pages. This is the TheoretiCS journal versionTheoretiCS, Volume 5 (February 26, 2026) theoretics:16426Markus LohreyMarkus L. Schmid10.46298/theoretics.26.6http://arxiv.org/abs/2603.14987v1Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI2026-03-16T08:51:33ZAs agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent's trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity -- particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at https://github.com/TonyQJH/haaf-pilot.2026-03-16T08:51:33Z6 pages, 1 figure. Submitted to KDD 2026 Blue Sky TrackJinhu QiYifan LiMinghao ZhaoWentao ZhangZijian ZhangYaoman LiIrwin Kinghttp://arxiv.org/abs/2603.12112v2Structure Selection for Fairness-Constrained Differentially Private Data Synthesis2026-03-16T03:42:01ZDifferential privacy (DP) enables safe data release, with synthetic data generation emerging as a common approach in recent years. Yet standard synthesizers preserve all dependencies in the data, including spurious correlations between sensitive attributes and outcomes. In fairness-critical settings, this reproduces unwanted bias. A principled remedy is to enforce conditional independence (CI) constraints, which encode domain knowledge or legal requirements that outcomes be independent of sensitive attributes once admissible factors are accounted for. DP synthesis typically proceeds in two phases: (i) a measure- ment step that privatizes selected marginals, often structured via maximum spanning trees (MSTs), and (ii) a reconstruction step that fits a probabilistic model consistent with the noisy marginals. We propose PrivCI, which enforces CI during the measurement step via a CI-aware greedy MST algorithm that integrates feasibility checks into Kruskal's construction under the exponential mechanism, improving accuracy over competing methods. Experiments on standard fairness benchmarks show that PrivCI achieves stronger fidelity and predictive accuracy than prior baselines while satisfying the specified CI constraints.2026-03-12T16:15:46Z8 pages, accepted to appear in an IEEE ICDE 2026 WorkshopNaeim GhahramanpourMostafa Milanihttp://arxiv.org/abs/2602.21566v2Epoch-based Optimistic Concurrency Control in Geo-replicated Databases2026-03-16T02:58:29ZGeo-distribution is essential for modern online applications to ensure service reliability and high availability. However, supporting high-performance serializable transactions in geo-replicated databases remains a significant challenge. This difficulty stems from the extensive over-coordination inherent in distributed atomic commitment, concurrency control, and fault-tolerance replication protocols under high network latency.
To address these challenges, we introduce Minerva, a unified distributed concurrency control designed for highly scalable multi-leader replication. Minerva employs a novel epoch-based asynchronous replication protocol that decouples data propagation from the commitment process, enabling continuous transaction replication. Optimistic concurrency control is used to allow any replicas to execute transactions concurrently and commit without coordination. In stead of aborting transactions when conflicts are detected, Minerva uses deterministic re-execution to resolve conflicts, ensuring serializability without sacrificing performance. To further enhance concurrency, we construct a conflict graph and use a maximum weight independent set algorithm to select the optimal subset of transactions for commitment, minimizing the number of re-executed transactions. Our evaluation demonstrates that Minerva significantly outperforms state-of-the-art replicated databases, achieving over $3\times$ higher throughput in scalability experiments and $2.8\times$ higher throughput during a high network latency simulation with the TPC-C benchmark.2026-02-25T04:44:50ZYunhao MaoHarunari TakataMichail BachrasYuqiu ZhangShiquan ZhangGengrui ZhangHans-Arno Jacobsen10.1145/3802052http://arxiv.org/abs/2603.14754v1Towards Parameterized Hardness on Maintaining Conjunctive Queries2026-03-16T02:39:08ZWe investigate the fine-grained complexity of dynamically maintaining the result of fixed self-join free conjunctive queries under single-tuple updates. Prior work shows that free-connex queries can be maintained in update time $O(|D|^δ)$ for some $δ\in [0.5, 1]$, where $|D|$ is the size of the current database. However, a gap remains between the best known upper bound of $O(|D|)$ and lower bounds of $Ω(|D|^{0.5-ε})$ for any $ε\ge 0$.
We narrow this gap by introducing two structural parameters to quantify the dynamic complexity of a conjunctive query: the height $k$ and the dimension $d$. We establish new fine-grained lower bounds showing that any algorithm maintaining a query with these parameters must incur update time $Ω(|D|^{1-1/\max(k,d)-ε})$, unless widely believed conjectures fail. These yield the first super-$\sqrt{|D|}$ lower bounds for maintaining free-connex queries, and suggest the tightness of current algorithms when considering arbitrarily large $k$ and~$d$.
Complementing our lower bounds, we identify a data-dependent parameter, the generalized $H$-index $h(D)$, which is upper bounded by $|D|^{1/d}$, and design an efficient algorithm for maintaining star queries, a common class of height 2 free-connex queries. The algorithm achieves an instance-specific update time $O(h(D)^{d-1})$ with linear space $O(|D|)$. This matches our parameterized lower bound and provides instance-specific performance in favorable cases.2026-03-16T02:39:08ZAccepted in PODS 2026Qichen Wanghttp://arxiv.org/abs/2509.23834v3GPM: The Gaussian Pancake Mechanism for Planting Undetectable Backdoors in Differential Privacy2026-03-15T22:15:36ZDifferential privacy (DP) has become the gold standard for preserving individual privacy in data analysis. However, an implicit yet fundamental assumption underlying these rigorous privacy guarantees is the correct implementation and execution of DP mechanisms. Several incidents of unintended privacy loss have occurred due to numerical issues and inappropriate configurations of DP software, which have been successfully exploited in privacy attacks. To better understand the seriousness of defective DP software, we ask the following question: is it possible to elevate these passive defects into active privacy attacks while maintaining covertness?
To address this question, we present the Gaussian pancake mechanism (GPM), a novel mechanism that is computationally indistinguishable from the widely used Gaussian mechanism (GM), yet exhibits arbitrarily weaker statistical DP guarantees. This unprecedented separation enables a new class of backdoor attacks: by indistinguishably passing off as the authentic GM, GPM can covertly degrade statistical privacy. Unlike the unintentional privacy loss caused by GM's numerical issues, GPM is an adversarial yet undetectable backdoor attack against data privacy. We formally prove GPM's covertness, characterize its statistical leakage, and demonstrate a concrete distinguishing attack that can achieve near-perfect success rates under suitable parameter choices, both theoretically and empirically.
Our results underscore the importance of using transparent, open-source DP libraries and highlight the need for rigorous scrutiny and formal verification of DP implementations to prevent subtle, undetectable privacy compromises in real-world systems.2025-09-28T12:14:06ZAccepted to ACM SIGMOD 2026. Please refer to https://github.com/jvhs0706/GPM for code and raw experiment logsHaochen SunXi Hehttp://arxiv.org/abs/2603.14492v1Oblivis: A Framework for Delegated and Efficient Oblivious Transfer2026-03-15T17:16:02ZAs database deployments shift toward cloud platforms and edge devices, thin clients need to securely retrieve sensitive records without leaking their query intent or metadata to the proxies that mediate access. Oblivious Transfer (OT) is a core tool for private retrieval, yet existing OTs assume direct client-database interaction and lack support for delegated querying or lightweight clients. We present Oblivis, a modular framework of new OT protocols that enable delegated, privacy-preserving query execution. Oblivis allows clients to retrieve database records without direct access, protects against leakage to both databases and proxies, and is designed with practical efficiency in mind. Its components include: (1) Delegated-Query OT, which permits secure outsourcing of query generation; (2) Multi-Receiver OT for merged, cloud-hosted databases; (3) a compiler producing constant-size responses suitable for thin clients; and (4) Supersonic OT, a proxy-based, informationtheoretic, and highly efficient 1-out-of-2 OT. The protocols are formally defined and proven secure in the simulation-based paradigm, under non-colluding assumption. We implement and empirically evaluate Supersonic OT. It achieves at least a 92x speedup over a highly efficient 1-out-of-2 OT, and a 2.6x-106x speedup over a standard OT extension across 200-100,000 invocations. Our implementation further shows that Supersonic OT remains efficient even on constrained hardware, e.g., it completes an end-to-end transfer in 1.36 ms on a Raspberry Pi 4.2026-03-15T17:16:02ZAydin AbadiYvo Desmedthttp://arxiv.org/abs/2603.14419v1Shape-Agnostic Table Overlap Discovery: A Maximum Common Subhypergraph Approach2026-03-15T15:03:23ZUnderstanding how two tables overlap is useful for many data management tasks, but challenging because tables often differ in row and column orders and lack reliable metadata in practice. Prior work defines the largest rectangular overlap, which identifies the maximal contiguous region of matching cells under row and column permutations. However, real overlaps are rarely rectangular, where many valid matches may lie outside any single contiguous block. In this paper, we introduce the Shape-Agnostic Largest Table Overlap (SALTO), a novel generalized notion of overlap that captures arbitrary-shaped, non-contiguous overlaps between tables.
To tackle the combinatorial complexity of row and column permutations, we propose to model each table as a hypergraph, casting SALTO computation into a maximum common subhypergraph problem. We prove their equivalence and show the problem is NP-hard to approximate. To solve it, we propose HyperSplit, a novel branch-and-bound algorithm tailored to table-induced hypergraphs. HyperSplit introduces (i) hypergraph-aware label classes that jointly encode cell values and their row-column memberships to ensure structurally valid correspondences without explicit permutation enumeration, (ii) incidence-guided refinement and upper-bound pruning that leverage row-column connectivity to eliminate infeasible partial matches early, and (iii) a tolerance-based optimization mechanism with a tunable parameter that relaxes pruning by a bounded margin to accelerate convergence, enabling scalable yet accurate overlap discovery. Experiments on real-world datasets show that HyperSplit discovers overlaps more effectively (larger overlaps in up to 78.8% of the cases) and more efficiently than state of the art. Three case studies further demonstrate its practical impact across three tasks: cross-source copy detection, data deduplication, and version comparison.2026-03-15T15:03:23ZTechnical report of a paper accepted to SIGMOD 2026Ge LeeShixun HuangZhifeng BaoFelix NaumannShazia SadiqYanchang Zhaohttp://arxiv.org/abs/2602.08590v3SDFed: Bridging Local Global Discrepancy via Subspace Refinement and Divergence Control in Federated Prompt Learning2026-03-15T12:52:45ZVision-language pretrained models offer strong transferable representations, yet adapting them in privacy-sensitive multi-party settings is challenging due to the high communication cost of federated optimization and the limited local data on clients. Federated prompt learning mitigates this issue by keeping the VLPM backbone frozen and collaboratively training lightweight prompt parameters. However, existing approaches typically enforce a unified prompt structure and length across clients, which is inadequate under practical client heterogeneity in both data distributions and system resources, and may further introduce conflicts between globally shared and locally optimal knowledge. To address these challenges, we propose \textbf{SDFed}, a heterogeneous federated prompt learning framework that bridges Local-Global Discrepancy via Subspace Refinement and Divergence Control. SDFed maintains a fixed-length global prompt for efficient aggregation while allowing each client to learn a variable-length local prompt to better match its data characteristics and capacity. To mitigate local-global conflicts and facilitate effective knowledge transfer, SDFed introduces a subspace refinement method for local prompts and an information retention and divergence control strategy that preserves key local information while maintaining appropriate separability between global and local representations. Extensive experiments on several datasets demonstrate that SDFed consistently improves performance and robustness in heterogeneous federated settings.2026-02-09T12:33:00Z13 pages, 6 figuresYicheng DiWei YuanTieke HeYuan LiuHongzhi Yinhttp://arxiv.org/abs/2603.14339v1Causal Search for Skylines (CSS): Causally-Informed Selective Data De-Correlation2026-03-15T12:15:18ZSkyline queries are popular and effective tools in multi-criteria decision support as they extract interesting (pareto-optimal) points that help summarize the available data with respect to a given set of preference attributes. Unfortunately, the efficiency of the skyline algorithms depends heavily on the underlying data statistics. In this paper, we argue that the efficiency of the skyline algorithms could be significantly boosted if one could erase any attribute correlations that do not agree with the preference criteria, while preserving (or even boosting) correlations that agree with the user provided criteria. Therefore, we propose a causallyinformed selective de-correlation mechanism to enable skyline algorithms to better leverage the pruning opportunities provided by the positively-aligned data distributions, without having to suffer from the mis-alignments. In particular, we show that, given a causal graph that describes the underlying causal structure of the data, one can identify a subset of the attributes that can be used to selectively de-correlate the preference attributes. Importantly, the proposed causal search for skylines (CSS) approach is agnostic to the underlying candidate enumeration and pruning strategies and, therefore, can be leveraged to improve any popular skyline discovery algorithm. Experiments on multiple real and synthetic data sets and for different skyline discovery algorithms show that the proposed causally-informed selective de-correlation technique significantly reduces both the number of dominance checks as well as the overall time needed to locate skyline points.2026-03-15T12:15:18ZSIGMOD 2026 (with extra appendix)Pratanu MandalAbhinav GorantlaK. Selçuk CandanMaria Luisa Sapinohttp://arxiv.org/abs/2503.05675v2Algorithmic Data Minimization for Machine Learning over Internet-of-Things Data Streams2026-03-15T06:55:37ZMachine learning can analyze vast amounts of data generated by IoT devices to identify patterns, make predictions, and enable real-time decision-making. By processing sensor data, machine learning models can optimize processes, improve efficiency, and enhance personalized user experiences in smart systems. However, IoT systems are often deployed in sensitive environments such as households and offices, where they may inadvertently expose identifiable information, including location, habits, and personal identifiers. This raises significant privacy concerns, necessitating the application of data minimization -- a foundational principle in emerging data regulations, which mandates that service providers only collect data that is directly relevant and necessary for a specified purpose. Despite its importance, data minimization lacks a precise technical definition in the context of sensor data, where collections of weak signals make it challenging to apply a binary "relevant and necessary" rule. This paper provides a technical interpretation of data minimization in the context of sensor streams, explores practical methods for implementation, and addresses the challenges involved. Through our approach, we demonstrate that our framework can reduce user identifiability by up to 16.7% while maintaining accuracy loss below 1%, offering a viable path toward privacy-preserving IoT data processing.2025-03-07T18:35:11Z10 pages, 9 figuresTed ShaowangShinan LiuJonatas MarquesNick FeamsterSanjay Krishnan10.14778/3773731.3773740http://arxiv.org/abs/2603.14242v1Wheel Dynamic Load Estimation Method Based on Gas Pressure of Hydro-pneumatic Suspension2026-03-15T06:30:19ZThis paper proposes a novel method to estimate the wheel dynamic load based on the gas pressure of a hydro-pneumatic suspension. A nonlinear coupled model between suspension chamber pressure and tire-ground contact force is developed, integrating suspension dynamics with its nonlinear stiffness characteristics. An iterative algorithm is developed to estimate wheel dynamic load using data from only one single pressure sensor, thereby eliminating the reliance on traditional tire models and complex multi-sensor fusion frameworks. This method effectively reduces hardware redundancy and minimizes the propagation of measurement errors. The proposed model is experimentally validated on a dedicated suspension test bench, demonstrating satisfactory agreement between the measured and estimated data. Additionally, co-simulation with TruckSim verifies the accuracy of both the calculated damping force and wheel dynamic load, demonstrating the effectiveness of the model on characterizing the mechanical behavior of the hydro-pneumatic suspension system. The proposed method provides a practical, low-cost, and efficient solution with minimal hardware dependencies.2026-03-15T06:30:19Z41 pages, 51 figuresQijun LiaoJue YangSubhash RakhejaYiting KangYumeng YaoYuming Yinhttp://arxiv.org/abs/2603.14190v1Sublime: Sublinear Error & Space for Unbounded Skewed Streams2026-03-15T02:57:17ZModern stream processing systems must often track the frequency of distinct keys in a data stream in real-time. Since monitoring the exact counts often entails a prohibitive memory footprint, many applications rely on compact, probabilistic data structures called frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects: (1) They are memory-inefficient under data skew. This is because they use uniformly-sized counters to track the key counts and thus waste memory on storing the leading zeros of many small counter values. (2) Their estimation error deteriorates at least linearly with the stream's length, which may grow indefinitely over time. This is because they count the keys using a fixed number~of~counters.
We present Sublime, a framework that generalizes frequency estimation sketches to address these problems by dynamically adapting to the stream's skew and length. To save memory under skew, Sublime uses short counters upfront and elongates them with extensions stored within the same cache line as they overflow. It leverages novel bit manipulation routines to quickly access a counter's extension. It also controls the scaling of its error rate by expanding its number of approximate counters as the stream grows. We apply Sublime to Count-Min Sketch and Count Sketch. We show, theoretically and empirically, that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.2026-03-15T02:57:17Z27 pages. 16 figures. 3 tables. Accepted to SIGMOD 2026Navid EslamiIoana O. BerceaRasmus PaghNiv Dayanhttp://arxiv.org/abs/2512.11129v2Acyclic Conjunctive Regular Path Queries are no Harder than Corresponding Conjunctive Queries2026-03-14T23:53:47ZWe present an output-sensitive algorithm for evaluating an acyclic Conjunctive Regular Path Query (CRPQ). Its complexity is written in terms of the input size, the output size, and a well-known parameter of the query that is called the "free-connex fractional hypertree width". Our algorithm improves upon the complexity of the recently introduced output-sensitive algorithm for acyclic CRPQs. More notably, the complexity of our algorithm for a given acyclic CRPQ Q matches the best known output-sensitive complexity for the "corresponding" conjunctive query (CQ), that is the CQ that has the same structure as the CRPQ Q except that each RPQ is replaced with a binary atom (or a join of two binary atoms). This implies that it is not possible to improve upon our complexity for acyclic CRPQs without improving the state-of-the-art on output-sensitive evaluation for acyclic CQs. Our result is surprising because RPQs, and by extension CRPQs, are equivalent to recursive Datalog programs, which are generally poorly understood from a complexity standpoint. Yet, our result implies that the recursion aspect of acyclic CRPQs does not add any extra complexity on top of the corresponding (non-recursive) CQs, at least as far as output-sensitive analysis is concerned.2025-12-11T21:25:15ZMahmoud Abo KhamisAlexandru-Mihai HurjuiAhmet KaraDan OlteanuDan Suciu