https://arxiv.org/api/+jv1YE4okb1ZPGoDfiNu/oq3a1U 2026-06-18T19:42:08Z 29013 345 15 http://arxiv.org/abs/2605.21033v1 Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification 2026-05-20T11:10:15Z

Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

2026-05-20T11:10:15Z To appear at VLDB 2026 Guangyi Zhang Lutz Oettershagen Lixu Wang Aristides Gionis http://arxiv.org/abs/2605.21015v1 Treewidth of the $n \times n$ toroidal grid 2026-05-20T10:50:16Z

In this paper, we show that the treewidth of the $n \times n$ toroidal grid is $2n-1$ for all $n \ge 5$. This closes the gap between the previously known upper bound of $2n-1$ (Ellis and Warren, DAM 2008) and the lower bound of $2n-2$ (Kiyomi, Okamoto, and Otachi, DAM 2016). To establish the matching lower bound, we construct a bramble of maximum order by utilizing maximum components obtained after removing $2n-1$ vertices. Our construction relies on the vertex-isoperimetric properties of the infinite grid to establish tight lower bounds on neighborhood sizes, combined with a careful analysis of balls of radius $n/2-1$ and their boundaries to overcome structural obstructions when $n$ is even.

2026-05-20T10:50:16Z 14 pages, 7 figures Tatsuya Gima Hiraku Morimoto Yuto Okada Yota Otachi http://arxiv.org/abs/2605.09464v2 The Impossibility of Simultaneous Time and I/O Optimality for The Planar Maxima and Convex Hull Problems 2026-05-20T08:53:42Z

We prove that no deterministic output-sensitive algorithm for the planar convex hull and maxima problems can obtain both optimal time and I/O complexity, where the optimality is defined with respect to both the input and output sizes. This explains why the best previous algorithms achieved an optimal I/O bound at the cost of sub-optimal running time (Goodrich et al. [FOCS, 1993]). To the best of our knowledge, the impossibility of simultaneous optimality was only shown previously for the permutation problem by Brodal and Fagerberg [STOC, 2003]. Our results imply that no optimal deterministic output-sensitive cache-oblivious algorithm exists for either problem. In addition, we present simple deterministic algorithms that match our lower bounds and that provide a trade-off between time and I/Os. On the other hand, a simple modification of our deterministic algorithm results in a randomized algorithm that simultaneously achieves optimal (worst-case) time and optimal expected I/O bounds.

2026-05-10T10:31:45Z Full version of the ICALP 2026 conference paper Peyman Afshani Gerth Stølting Brodal Nodari Sitchinava http://arxiv.org/abs/2410.18915v4 Testing Support Size More Efficiently Than Learning Histograms 2026-05-20T08:53:09Z

Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$ε$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $Θ(\tfrac{n}{ε^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{ε\log n} \log(1/ε))$ samples, nearly matching the best known lower bound of $Ω(\tfrac{n}{ε\log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

2024-10-24T17:05:34Z 42 pages. This is the TheoretiCS journal version TheoretiCS, Volume 5 (May 21, 2026) theoretics:16717 Renato Ferreira Pinto Nathaniel Harms 10.46298/theoretics.26.10 http://arxiv.org/abs/2605.20897v1 Creating Robust and Fair Graph Structures for Connectivity and Clustering 2026-05-20T08:37:40Z

Graph algorithms are central to large-scale applications such as navigation systems, social networks, and data analysis platforms. This thesis studies two important challenges in such systems: robustness to failures and fairness in clustering outcomes. In the first part, we investigate fault-tolerant reachability preservers in directed graphs. We present the first non-trivial constructions of dual fault-tolerant pairwise reachability preservers that remain resilient to two edge or vertex failures, achieving a sparse construction of size $O(n^{4/3}|\mathcal{P}|^{1/3})$. In the second part, we study fair clustering algorithms that ensure balanced representation of protected groups. We develop approximation algorithms for fair consensus clustering and introduce the framework of closest fair clustering, establishing hardness results and efficient algorithms for multi-group settings. Building on this framework, we obtain improved guarantees for fair correlation clustering and design the first streaming algorithm for fair consensus clustering using only logarithmic memory. Together, these results contribute toward the design of graph algorithms that are both robust and socially responsible.

2026-05-20T08:37:40Z This work is a PhD Thesis Kushagra Chatterjee http://arxiv.org/abs/2605.20789v1 Circuits of Quantum Hashing and Quantum Fourier Transform for a Cactus as a Qubit Connectivity Graph 2026-05-20T06:35:05Z

We present a quantum circuit implementation of the quantum hashing algorithm (quantum fingerprinting) for a quantum device with restrictions on the application of two-qubit gates by a qubit connectivity graph. We present an optimization technique for the shallow circuit for quantum hashing in the case of a cactus as a qubit connectivity graph. The algorithm has $O(n^3)$ complexity to build the circuit, where $n$ is the number of qubits and $m$ is the number of connections (edges) in the graph. It is improvement compared to the existing exponential-time algorithm in the case of arbitrary graphs. The algorithm uses solution for the shortest non-simple 1-covering path problem as a subroutine. We present an $O(n^3)$-time solution for this graph-theory problem in the case of a cactus. This result can be interesting independently. The algorithm also used for improving of the quantum circuit for Quantum Fourier Transform.

2026-05-20T06:35:05Z accepted by UCNC2026 Kamil Khadiev Ilnur Valeev http://arxiv.org/abs/2605.04465v2 Inverse Quadratic Decay in Random Subset Sum 2026-05-20T04:34:04Z

The Subset Sum Problem is a fundamental NP-complete problem in cryptography and combinatorial optimization, with many real-world applications. The Random Subset Sum Problem (RSSP) is a more applicable version of subset sum, where numbers are drawn from some i.i.d input distribution. We present an algorithm that, with probability $1-δ$, constructs the same $O(B/w)$ mesh as Da Cunha et al. (2023), while trimming to $w$ elements throughout and running in $O(w\log w)$ time. Then, we present a novel beam search heuristic running in linearithmic time w.r.t list size $n$ and beam width $w$ using the mesh that gives an expected error of $O\!\left(\frac{B}{nw^2}\right)$ under a standard mean-field assumption with equal standard deviation, demonstrating the practical effectiveness of meshing to achieve error decay. The algorithm is empirically robust to multiple input distributions and can naturally extend to variants with simple changes to the scoring heuristic, establishing a new practical baseline for robust subset sum error decay and $ε$-approximation theory.

2026-05-06T03:46:44Z Under Review at ACM TALG Edwin Chen Christof Teuscher http://arxiv.org/abs/2511.07846v2 Model-agnostic super-resolution in high dimensions 2026-05-20T02:36:27Z

The problem of super-resolution, roughly speaking, is to reconstruct an unknown signal to high accuracy, given (potentially noisy) information about its low-degree Fourier coefficients. Prior results on super-resolution have imposed strong modeling assumptions on the signal, typically requiring that it is a linear combination of spatially separated point sources. In this work we analyze a very general version of the super-resolution problem by considering completely general non-negative signals (equivalently, distributions) over the $d$-dimensional torus $[0,1)^d$; we do not assume any spatial separation between point sources, or even that the distribution is a finite linear combination of point sources. The question naturally arises: what can be said about super-resolution in such a general setting? - As a warm-up, we first give a set of results for reconstructing distributions under the Wasserstein distance. We establish essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $κ$ of the noise for which accurate reconstruction is possible: we show that for $d$-dimensional distributions, estimates of $\approx \exp(d)$ many Fourier coefficients are both necessary and sufficient for accurate Wasserstein reconstruction. - As our main result, we define a new notion of "heavy hitter" reconstruction for distributions, which essentially amounts to achieving high-accuracy reconstruction of all "sufficiently dense" regions of the distribution. We give essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $κ$ of the noise for which accurate reconstruction is possible under this notion. Our results show that (in sharp contrast with Wasserstein reconstruction) accurate estimates of only $\approx \exp(\sqrt{d})$ many Fourier coefficients are both necessary and sufficient for heavy hitter reconstruction.

2025-11-11T05:28:08Z Xi Chen Anindya De Yizhi Huang Shivam Nadimpalli Rocco A. Servedio Tianqi Yang http://arxiv.org/abs/2602.03436v3 The Complexity of Maximal/Closed Frequent Tree Mining for Bounded Height Trees 2026-05-20T01:38:56Z

Frequent tree mining asks us to enumerate tree patterns that occur frequently in a database of rooted trees. This problem is motivated by tree-structured data in bioinformatics, such as glycans and pseudoknot-free RNA secondary structures. A direct enumeration of all frequent trees is often highly redundant, because every subtree of a frequent tree is again frequent. Closed and maximal frequent trees are standard ways to reduce this redundancy, but their enumeration can still be computationally hard. In this paper, we study the effect of bounding the height of the input trees. This is a natural restriction for rooted trees, since the height is the depth of the hierarchy. We ask whether closed/maximal frequent tree mining remains hard when every input tree has a small height. Our results show that the answer depends sharply on the model. For rooted unordered trees of height at most 2, we give a polynomial-delay algorithm for enumerating closed frequent trees. On the other hand, for rooted ordered trees of height at most 2, we show that an output-polynomial time algorithm for enumerating closed frequent trees would imply an output-polynomial time algorithm for Dualization. For maximal frequent tree enumeration, we prove that no output-polynomial time algorithm exists unless P = NP already for rooted ordered trees of height at most 2 and for rooted unordered trees of height at most 3. Thus, even very small height bounds do not make the enumeration problems easy in general. At the same time, the unordered closed case of height at most 2 admits polynomial-delay enumeration. These results give a height-based classification of the complexity of closed and maximal frequent tree mining on shallow rooted trees.

2026-02-03T12:00:13Z Kenta Komoto Kazuhiro Kurita Hirotaka Ono http://arxiv.org/abs/2503.10972v2 A $(2+\varepsilon)$-Approximation Algorithm for Metric $k$-Median 2026-05-19T22:37:02Z

In the classical NP-hard metric $k$-median problem, we are given a set of $n$ clients and centers with metric distances between them, along with an integer parameter $k\geq 1$. The objective is to select a subset of $k$ open centers that minimizes the total distance from each client to its closest open center. In their seminal work, Jain, Mahdian, Markakis, Saberi, and Vazirani presented the Greedy algorithm for facility location, which implies a $2$-approximation algorithm for $k$-median that opens $k$ centers in expectation. Since then, substantial research has aimed at narrowing the gap between their algorithm and the best achievable approximation by an algorithm guaranteed to open exactly $k$ centers. During the last decade, all improvements have been achieved by leveraging their algorithm or a small improvement thereof, followed by a second step called bi-point rounding, which inherently increases the approximation guarantee. Our main result closes this gap: for any $ε>0$, we present a $(2+ε)$-approximation algorithm for $k$-median, improving the previous best-known approximation factor of $2.613$. Our approach builds on a combination of two algorithms. First, we present a non-trivial modification of the Greedy algorithm that operates with $O(\log n/ε^2)$ adaptive phases. Through a novel walk-between-solutions approach, this enables us to construct a $(2+ε)$-approximation algorithm for $k$-median that consistently opens at most $k + O(\log n{/ε^2})$ centers. Second, we develop a novel $(2+ε)$-approximation algorithm tailored for stable instances, where removing any center from an optimal solution increases the cost by at least an $Ω(ε^3/\log n)$ fraction. Achieving this involves a sampling approach inspired by the $k$-means++ algorithm and a reduction to submodular optimization subject to a partition matroid.

2025-03-14T00:36:26Z Vincent Cohen-Addad Fabrizio Grandoni Euiwoong Lee Chris Schwiegelshohn Ola Svensson http://arxiv.org/abs/2605.20526v1 An $O(n^5)$-Time Algorithm for Optimal Broadcast Domination 2026-05-19T21:54:58Z

Broadcast domination assigns a nonnegative integer power to every vertex of a graph so that every vertex is within the assigned power of some broadcasting vertex, and the objective is to minimize the sum of the powers. Heggernes and Lokshtanov proved that the problem is polynomial-time solvable on arbitrary connected unweighted graphs by showing that some optimal efficient broadcast has a domination graph that is a path or a cycle, and by reducing the general case to an $O(n^6)$-time algorithm. This paper gives an efficient algorithm of the path-case. Instead of building one auxiliary acyclic graph for every possible left endpoint vertex, we build a single directed acyclic graph whose states are oriented broadcast balls together with their two possible residual sides. The resulting path-case algorithm runs in $O(n^3)$ time and $O(n^3)$ space on an $n$-vertex graph. Combining this routine with the same peel-one-ball reduction of Heggernes and Lokshtanov yields an exact $O(n^5)$-time algorithm for optimal broadcast domination on arbitrary connected unweighted graphs. This resolves the quintic-time conjecture for general graphs attributed to Heggernes and Sæther and recorded in subsequent surveys of broadcast domination.

2026-05-19T21:54:58Z Kleitos Papadopoulos http://arxiv.org/abs/2605.18623v2 An Approximation Algorithm for Graph Label Selection 2026-05-19T20:21:52Z

In the graph label selection problem, one is given an $n$-vertex graph and a budget $k$, and seeks to select $k$ vertices whose labels enable accurate prediction of the labels on the remaining vertices. This problem formalizes distilling a small representative set from the whole graph. We present the first $\tilde{O}(\log^{1.5} n)$-approximation algorithm for graph label selection under the standard budget constraint. Prior work either relies on resource augmentation, allowing substantially more than $k$ labeled vertices, or consists primarily of heuristics without provable guarantees. Finally, we demonstrate that practical heuristic variants of our algorithm scale to significantly larger graphs than previous methods, while essentially retaining their quality.

2026-05-18T16:32:40Z Accepted at ICML 2026. 9 pages, 7 figures Josia John Simon Meierhans Maximilian Probst Gutenberg http://arxiv.org/abs/2510.27588v3 Learned Static Function Data Structures 2026-05-19T17:47:18Z

We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.

2025-10-31T16:09:53Z PVLDB, 19(5): 917-930, 2026 Stefan Hermann Hans-Peter Lehmann Giorgio Vinciguerra Stefan Walzer 10.14778/3796195.3796205 http://arxiv.org/abs/2605.20070v1 Optimizing for Fairness in Generalized Kidney Exchange: Theory and Computations 2026-05-19T16:23:51Z

The seminal work of Roth, Sönmez, & Ünver shows that the Edmonds-Gallai structure theorem for non-bipartite matching can be leveraged to yield a randomized algorithm to match patient-donor pairs in kidney exchange with extraordinarily strong properties. This breakthrough led to randomized polynomial-time algorithms to find a maximum-cardinality matching maximizing individual fairness objectives--measured by the probability that nodes are matched--such as Nash social welfare. But the exchanges allowed in practice go beyond cardinality matching, generalizing to weighted variants and allowing structures such as paths and 3-cycles. We show that strongly polynomial algorithms guaranteeing the same fairness properties can be obtained in weighted settings for matching and 2-paths. While even maximum cardinality coverage with cycles and paths of length at least three is NP-hard, we provide a general result showing that any optimization subroutine (for whichever structure is allowed) can be bootstrapped using a polynomial number of calls to yield a mechanism that has analogous fairness properties to those obtained for matching. We complement these theoretical results with computational results, both on well-studied synthetic data-sets and on samples drawn from real data, that demonstrate the striking advantages of adding fairness considerations to more general kidney-exchange mechanisms.

2026-05-19T16:23:51Z published in IOS 2026 Claire Chang Arin Khare David Shmoys http://arxiv.org/abs/2404.16676v2 Multilayer Correlation Clustering 2026-05-19T15:51:54Z

We establish Multilayer Correlation Clustering, a novel generalization of Correlation Clustering to the multilayer setting. In this model, we are given a series of inputs of Correlation Clustering (called layers) over the common set $V$ of $n$ elements. The goal is to find a clustering of $V$ that minimizes the $\ell_p$-norm ($p\geq 1$) of the multilayer-disagreements vector, which is defined as the vector (with dimension equal to the number of layers), each element of which represents the disagreements of the clustering on the corresponding layer. For this generalization, we first design an $O(L\log n)$-approximation algorithm, where $L$ is the number of layers. We then study an important special case of our problem, namely the problem with the so-called probability constraint. For this case, we first give an $(α+2)$-approximation algorithm, where $α$ is any possible approximation ratio for the single-layer counterpart. Furthermore, we design a $4$-approximation algorithm, which improves the above approximation ratio of $α+2=4.5$ for the general probability-constraint case. Computational experiments using real-world datasets support our theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.

2024-04-25T15:25:30Z AISTATS 2026 Atsushi Miyauchi Florian Adriaens Francesco Bonchi Nikolaj Tatti