https://arxiv.org/api/+jv1YE4okb1ZPGoDfiNu/oq3a1U2026-06-18T19:42:08Z2901334515http://arxiv.org/abs/2605.21033v1Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification2026-05-20T11:10:15ZData valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.2026-05-20T11:10:15ZTo appear at VLDB 2026Guangyi ZhangLutz OettershagenLixu WangAristides Gionishttp://arxiv.org/abs/2605.21015v1Treewidth of the $n \times n$ toroidal grid2026-05-20T10:50:16ZIn this paper, we show that the treewidth of the $n \times n$ toroidal grid is $2n-1$ for all $n \ge 5$. This closes the gap between the previously known upper bound of $2n-1$ (Ellis and Warren, DAM 2008) and the lower bound of $2n-2$ (Kiyomi, Okamoto, and Otachi, DAM 2016). To establish the matching lower bound, we construct a bramble of maximum order by utilizing maximum components obtained after removing $2n-1$ vertices. Our construction relies on the vertex-isoperimetric properties of the infinite grid to establish tight lower bounds on neighborhood sizes, combined with a careful analysis of balls of radius $n/2-1$ and their boundaries to overcome structural obstructions when $n$ is even.2026-05-20T10:50:16Z14 pages, 7 figuresTatsuya GimaHiraku MorimotoYuto OkadaYota Otachihttp://arxiv.org/abs/2605.09464v2The Impossibility of Simultaneous Time and I/O Optimality for The Planar Maxima and Convex Hull Problems2026-05-20T08:53:42ZWe prove that no deterministic output-sensitive algorithm for the planar convex hull and maxima problems can obtain both optimal time and I/O complexity, where the optimality is defined with respect to both the input and output sizes. This explains why the best previous algorithms achieved an optimal I/O bound at the cost of sub-optimal running time (Goodrich et al. [FOCS, 1993]). To the best of our knowledge, the impossibility of simultaneous optimality was only shown previously for the permutation problem by Brodal and Fagerberg [STOC, 2003]. Our results imply that no optimal deterministic output-sensitive cache-oblivious algorithm exists for either problem. In addition, we present simple deterministic algorithms that match our lower bounds and that provide a trade-off between time and I/Os. On the other hand, a simple modification of our deterministic algorithm results in a randomized algorithm that simultaneously achieves optimal (worst-case) time and optimal expected I/O bounds.2026-05-10T10:31:45ZFull version of the ICALP 2026 conference paperPeyman AfshaniGerth Stølting BrodalNodari Sitchinavahttp://arxiv.org/abs/2410.18915v4Testing Support Size More Efficiently Than Learning Histograms2026-05-20T08:53:09ZConsider two problems about an unknown probability distribution $p$:
1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$ε$-far" (in total variation distance) from being supported on $n$ elements.
2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce?
The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $Θ(\tfrac{n}{ε^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{ε\log n} \log(1/ε))$ samples, nearly matching the best known lower bound of $Ω(\tfrac{n}{ε\log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.2024-10-24T17:05:34Z42 pages. This is the TheoretiCS journal versionTheoretiCS, Volume 5 (May 21, 2026) theoretics:16717Renato Ferreira PintoNathaniel Harms10.46298/theoretics.26.10http://arxiv.org/abs/2605.20897v1Creating Robust and Fair Graph Structures for Connectivity and Clustering2026-05-20T08:37:40ZGraph algorithms are central to large-scale applications such as navigation systems, social networks, and data analysis platforms. This thesis studies two important challenges in such systems: robustness to failures and fairness in clustering outcomes. In the first part, we investigate fault-tolerant reachability preservers in directed graphs. We present the first non-trivial constructions of dual fault-tolerant pairwise reachability preservers that remain resilient to two edge or vertex failures, achieving a sparse construction of size $O(n^{4/3}|\mathcal{P}|^{1/3})$.
In the second part, we study fair clustering algorithms that ensure balanced representation of protected groups. We develop approximation algorithms for fair consensus clustering and introduce the framework of closest fair clustering, establishing hardness results and efficient algorithms for multi-group settings. Building on this framework, we obtain improved guarantees for fair correlation clustering and design the first streaming algorithm for fair consensus clustering using only logarithmic memory. Together, these results contribute toward the design of graph algorithms that are both robust and socially responsible.2026-05-20T08:37:40ZThis work is a PhD ThesisKushagra Chatterjeehttp://arxiv.org/abs/2605.20789v1Circuits of Quantum Hashing and Quantum Fourier Transform for a Cactus as a Qubit Connectivity Graph2026-05-20T06:35:05ZWe present a quantum circuit implementation of the quantum hashing algorithm (quantum fingerprinting) for a quantum device with restrictions on the application of two-qubit gates by a qubit connectivity graph. We present an optimization technique for the shallow circuit for quantum hashing in the case of a cactus as a qubit connectivity graph. The algorithm has $O(n^3)$ complexity to build the circuit, where $n$ is the number of qubits and $m$ is the number of connections (edges) in the graph. It is improvement compared to the existing exponential-time algorithm in the case of arbitrary graphs. The algorithm uses solution for the shortest non-simple 1-covering path problem as a subroutine. We present an $O(n^3)$-time solution for this graph-theory problem in the case of a cactus. This result can be interesting independently. The algorithm also used for improving of the quantum circuit for Quantum Fourier Transform.2026-05-20T06:35:05Zaccepted by UCNC2026Kamil KhadievIlnur Valeevhttp://arxiv.org/abs/2605.04465v2Inverse Quadratic Decay in Random Subset Sum2026-05-20T04:34:04ZThe Subset Sum Problem is a fundamental NP-complete problem in cryptography and combinatorial optimization, with many real-world applications. The Random Subset Sum Problem (RSSP) is a more applicable version of subset sum, where numbers are drawn from some i.i.d input distribution. We present an algorithm that, with probability $1-δ$, constructs the same $O(B/w)$ mesh as Da Cunha et al. (2023), while trimming to $w$ elements throughout and running in $O(w\log w)$ time. Then, we present a novel beam search heuristic running in linearithmic time w.r.t list size $n$ and beam width $w$ using the mesh that gives an expected error of $O\!\left(\frac{B}{nw^2}\right)$ under a standard mean-field assumption with equal standard deviation, demonstrating the practical effectiveness of meshing to achieve error decay. The algorithm is empirically robust to multiple input distributions and can naturally extend to variants with simple changes to the scoring heuristic, establishing a new practical baseline for robust subset sum error decay and $ε$-approximation theory.2026-05-06T03:46:44ZUnder Review at ACM TALGEdwin ChenChristof Teuscherhttp://arxiv.org/abs/2511.07846v2Model-agnostic super-resolution in high dimensions2026-05-20T02:36:27ZThe problem of super-resolution, roughly speaking, is to reconstruct an unknown signal to high accuracy, given (potentially noisy) information about its low-degree Fourier coefficients. Prior results on super-resolution have imposed strong modeling assumptions on the signal, typically requiring that it is a linear combination of spatially separated point sources.
In this work we analyze a very general version of the super-resolution problem by considering completely general non-negative signals (equivalently, distributions) over the $d$-dimensional torus $[0,1)^d$; we do not assume any spatial separation between point sources, or even that the distribution is a finite linear combination of point sources. The question naturally arises: what can be said about super-resolution in such a general setting?
- As a warm-up, we first give a set of results for reconstructing distributions under the Wasserstein distance. We establish essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $κ$ of the noise for which accurate reconstruction is possible: we show that for $d$-dimensional distributions, estimates of $\approx \exp(d)$ many Fourier coefficients are both necessary and sufficient for accurate Wasserstein reconstruction.
- As our main result, we define a new notion of "heavy hitter" reconstruction for distributions, which essentially amounts to achieving high-accuracy reconstruction of all "sufficiently dense" regions of the distribution. We give essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $κ$ of the noise for which accurate reconstruction is possible under this notion. Our results show that (in sharp contrast with Wasserstein reconstruction) accurate estimates of only $\approx \exp(\sqrt{d})$ many Fourier coefficients are both necessary and sufficient for heavy hitter reconstruction.2025-11-11T05:28:08ZXi ChenAnindya DeYizhi HuangShivam NadimpalliRocco A. ServedioTianqi Yanghttp://arxiv.org/abs/2602.03436v3The Complexity of Maximal/Closed Frequent Tree Mining for Bounded Height Trees2026-05-20T01:38:56ZFrequent tree mining asks us to enumerate tree patterns that occur frequently in a database of rooted trees. This problem is motivated by tree-structured data in bioinformatics, such as glycans and pseudoknot-free RNA secondary structures. A direct enumeration of all frequent trees is often highly redundant, because every subtree of a frequent tree is again frequent. Closed and maximal frequent trees are standard ways to reduce this redundancy, but their enumeration can still be computationally hard.
In this paper, we study the effect of bounding the height of the input trees. This is a natural restriction for rooted trees, since the height is the depth of the hierarchy. We ask whether closed/maximal frequent tree mining remains hard when every input tree has a small height. Our results show that the answer depends sharply on the model. For rooted unordered trees of height at most 2, we give a polynomial-delay algorithm for enumerating closed frequent trees. On the other hand, for rooted ordered trees of height at most 2, we show that an output-polynomial time algorithm for enumerating closed frequent trees would imply an output-polynomial time algorithm for Dualization. For maximal frequent tree enumeration, we prove that no output-polynomial time algorithm exists unless P = NP already for rooted ordered trees of height at most 2 and for rooted unordered trees of height at most 3.
Thus, even very small height bounds do not make the enumeration problems easy in general. At the same time, the unordered closed case of height at most 2 admits polynomial-delay enumeration. These results give a height-based classification of the complexity of closed and maximal frequent tree mining on shallow rooted trees.2026-02-03T12:00:13ZKenta KomotoKazuhiro KuritaHirotaka Onohttp://arxiv.org/abs/2503.10972v2A $(2+\varepsilon)$-Approximation Algorithm for Metric $k$-Median2026-05-19T22:37:02ZIn the classical NP-hard metric $k$-median problem, we are given a set of $n$ clients and centers with metric distances between them, along with an integer parameter $k\geq 1$. The objective is to select a subset of $k$ open centers that minimizes the total distance from each client to its closest open center.
In their seminal work, Jain, Mahdian, Markakis, Saberi, and Vazirani presented the Greedy algorithm for facility location, which implies a $2$-approximation algorithm for $k$-median that opens $k$ centers in expectation. Since then, substantial research has aimed at narrowing the gap between their algorithm and the best achievable approximation by an algorithm guaranteed to open exactly $k$ centers. During the last decade, all improvements have been achieved by leveraging their algorithm or a small improvement thereof, followed by a second step called bi-point rounding, which inherently increases the approximation guarantee.
Our main result closes this gap: for any $ε>0$, we present a $(2+ε)$-approximation algorithm for $k$-median, improving the previous best-known approximation factor of $2.613$. Our approach builds on a combination of two algorithms. First, we present a non-trivial modification of the Greedy algorithm that operates with $O(\log n/ε^2)$ adaptive phases. Through a novel walk-between-solutions approach, this enables us to construct a $(2+ε)$-approximation algorithm for $k$-median that consistently opens at most $k + O(\log n{/ε^2})$ centers. Second, we develop a novel $(2+ε)$-approximation algorithm tailored for stable instances, where removing any center from an optimal solution increases the cost by at least an $Ω(ε^3/\log n)$ fraction. Achieving this involves a sampling approach inspired by the $k$-means++ algorithm and a reduction to submodular optimization subject to a partition matroid.2025-03-14T00:36:26ZVincent Cohen-AddadFabrizio GrandoniEuiwoong LeeChris SchwiegelshohnOla Svenssonhttp://arxiv.org/abs/2605.20526v1An $O(n^5)$-Time Algorithm for Optimal Broadcast Domination2026-05-19T21:54:58ZBroadcast domination assigns a nonnegative integer power to every vertex of a graph so that every vertex is within the assigned power of some broadcasting vertex, and the objective is to minimize the sum of the powers. Heggernes and Lokshtanov proved that the problem is polynomial-time solvable on arbitrary connected unweighted graphs by showing that some optimal efficient broadcast has a domination graph that is a path or a cycle, and by reducing the general case to an $O(n^6)$-time algorithm. This paper gives an efficient algorithm of the path-case. Instead of building one auxiliary acyclic graph for every possible left endpoint vertex, we build a single directed acyclic graph whose states are oriented broadcast balls together with their two possible residual sides. The resulting path-case algorithm runs in $O(n^3)$ time and $O(n^3)$ space on an $n$-vertex graph. Combining this routine with the same peel-one-ball reduction of Heggernes and Lokshtanov yields an exact $O(n^5)$-time algorithm for optimal broadcast domination on arbitrary connected unweighted graphs. This resolves the quintic-time conjecture for general graphs attributed to Heggernes and Sæther and recorded in subsequent surveys of broadcast domination.2026-05-19T21:54:58ZKleitos Papadopouloshttp://arxiv.org/abs/2605.18623v2An Approximation Algorithm for Graph Label Selection2026-05-19T20:21:52ZIn the graph label selection problem, one is given an $n$-vertex graph and a budget $k$, and seeks to select $k$ vertices whose labels enable accurate prediction of the labels on the remaining vertices. This problem formalizes distilling a small representative set from the whole graph. We present the first $\tilde{O}(\log^{1.5} n)$-approximation algorithm for graph label selection under the standard budget constraint. Prior work either relies on resource augmentation, allowing substantially more than $k$ labeled vertices, or consists primarily of heuristics without provable guarantees. Finally, we demonstrate that practical heuristic variants of our algorithm scale to significantly larger graphs than previous methods, while essentially retaining their quality.2026-05-18T16:32:40ZAccepted at ICML 2026. 9 pages, 7 figuresJosia JohnSimon MeierhansMaximilian Probst Gutenberghttp://arxiv.org/abs/2510.27588v3Learned Static Function Data Structures2026-05-19T17:47:18ZWe consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.2025-10-31T16:09:53ZPVLDB, 19(5): 917-930, 2026Stefan HermannHans-Peter LehmannGiorgio VinciguerraStefan Walzer10.14778/3796195.3796205http://arxiv.org/abs/2605.20070v1Optimizing for Fairness in Generalized Kidney Exchange: Theory and Computations2026-05-19T16:23:51ZThe seminal work of Roth, Sönmez, & Ünver shows that the Edmonds-Gallai structure theorem for non-bipartite matching can be leveraged to yield a randomized algorithm to match patient-donor pairs in kidney exchange with extraordinarily strong properties. This breakthrough led to randomized polynomial-time algorithms to find a maximum-cardinality matching maximizing individual fairness objectives--measured by the probability that nodes are matched--such as Nash social welfare. But the exchanges allowed in practice go beyond cardinality matching, generalizing to weighted variants and allowing structures such as paths and 3-cycles. We show that strongly polynomial algorithms guaranteeing the same fairness properties can be obtained in weighted settings for matching and 2-paths. While even maximum cardinality coverage with cycles and paths of length at least three is NP-hard, we provide a general result showing that any optimization subroutine (for whichever structure is allowed) can be bootstrapped using a polynomial number of calls to yield a mechanism that has analogous fairness properties to those obtained for matching. We complement these theoretical results with computational results, both on well-studied synthetic data-sets and on samples drawn from real data, that demonstrate the striking advantages of adding fairness considerations to more general kidney-exchange mechanisms.2026-05-19T16:23:51Zpublished in IOS 2026Claire ChangArin KhareDavid Shmoyshttp://arxiv.org/abs/2404.16676v2Multilayer Correlation Clustering2026-05-19T15:51:54ZWe establish Multilayer Correlation Clustering, a novel generalization of Correlation Clustering to the multilayer setting. In this model, we are given a series of inputs of Correlation Clustering (called layers) over the common set $V$ of $n$ elements. The goal is to find a clustering of $V$ that minimizes the $\ell_p$-norm ($p\geq 1$) of the multilayer-disagreements vector, which is defined as the vector (with dimension equal to the number of layers), each element of which represents the disagreements of the clustering on the corresponding layer. For this generalization, we first design an $O(L\log n)$-approximation algorithm, where $L$ is the number of layers. We then study an important special case of our problem, namely the problem with the so-called probability constraint. For this case, we first give an $(α+2)$-approximation algorithm, where $α$ is any possible approximation ratio for the single-layer counterpart. Furthermore, we design a $4$-approximation algorithm, which improves the above approximation ratio of $α+2=4.5$ for the general probability-constraint case. Computational experiments using real-world datasets support our theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.2024-04-25T15:25:30ZAISTATS 2026Atsushi MiyauchiFlorian AdriaensFrancesco BonchiNikolaj Tatti