TC-MIS: Maximal Independent Set on Tensor-cores

2026-05-28T08:40:56Z

Maximal Independent Set (MIS) in a graph is a fundamental problem with applications in resource allocation, scheduling, and network optimization. Although graphs are inherently un-structured and challenging for GPU parallelism due to irregular memory access and workload imbalance, specialized GPU algorithms have achieved good performance, processing million-vertex graphs in milliseconds. Modern GPUs are equipped with Tensor Cores (TCs), specialized units for matrix operations with 8-16x higher throughput than CUDA Cores (CCs), which are extensively used for ML, DL, and inference tasks but remain largely unexplored for graph algorithms. In this paper, we present TC-MIS, a TC-accelerated algorithm that reformulates key phases of MIS computation as sparse matrix-vector multiplication (SpMV). TC-MIS tiles the graph adjacency matrix and employs Warp Matrix Multiply-Accumulate (WMMA) operations to transform irregular graph traversal into regular, massively parallel computation. Our evaluation across TC-enabled microarchitectures (Ampere, Ada Lovelace, Hopper, Blackwell) demonstrates that TC-MIS achieves an average speedup of 2.84x on RTX A5000, 4.84x on L40S, 18.80x on H200 GPUs, and 5.20x on RTX 5080 with a maximum speedup of 44.38x on H200 GPU over state-of-the-art methods, while maintaining solution quality comparable to that obtained by established heuristics that produce near-maximum independent sets.

Sampling Directed Eulerian Tours in $\widetilde O(m^{3/2})$ Time

2026-05-28T08:17:10Z

We give a randomized algorithm that samples a nearly uniform Eulerian tour of a directed Eulerian multigraph with $m$ arcs in $\widetilde O(m^{3/2})$ time. The guarantee is worst-case, applies to arbitrary directed Eulerian multigraphs, and breaks the $mn$-type arborescence-sampling barrier on sparse graphs. The core case is a $2$-in/$2$-out graph. We introduce a new local Markov chain, the flip--repair walk: one step locally splits a tour into two circuits and then chooses uniformly among the local flips that repair the state to one tour. We prove that this walk mixes in nearly linear many steps and implement the walk using a dynamic chord data structure. A pointwise degree-reduction wrapper extends the sampler from this degree-two core to arbitrary degrees while preserving the $\widetilde O(m^{3/2})$ total running time. The high-level algorithmic plan, the switching-network reduction, and the dynamic data-structure argument were devised by the author. The author conjectured the mixing theorem underlying the analysis, and GPT 5.5 Pro Extended produced its linear-algebra proof. Codex assisted with manuscript assembly and typesetting.

Explaining Rankings with Hidden Group Bonuses

2026-05-28T06:36:11Z

Determining a linear utility function that correlates with observed candidate rankings is a foundational problem with applications in domains such as admissions, hiring, and recommendation systems, e.g., [Storandt and Funke, AAAI'19, Zhang et al., KDD'23, Wang et al., ICDE'24 (best paper award), Chen and Wong, VLDB'24]. Traditionally, these models assume full visibility into the feature sets used to determine the utility score. However, real-world scenarios often involve sensitive attributes that are hidden or partially observed, yet may influence outcomes through additive bonuses designed to promote fairness, as in [Gale and Marian, ICDE'24]. Motivated by such practical concerns, we study a variant of the ranking explanation problem where sensitive features are unobserved but may influence candidate rankings through group-specific linear boosts. We present a formal framework for modeling this problem and develop an algorithmic solution that leverages constraint satisfaction and automated reasoning techniques to jointly infer the linear scoring parameters and latent group bonuses consistent with the observed rankings. We further show that determining a satisfying linear function with group-specific bonuses is \textsf{NP}-hard in general, but when the feature dimension and the number of groups are constant, the problem admits a polynomial-time solution. Our approach is the first to address this nuanced variant, which captures key real-world challenges in fair ranking and admission systems. We perform extensive experiments on both real-world and synthetic datasets, demonstrating that our method effectively recovers hidden bonus structures and provides faithful explanations of observed ranking outcomes.

Distributed Gaussian Mean Testing under Communication Constraints: messages, samples, and coins

2026-05-28T06:20:43Z

We revisit the problem of Gaussian mean testing in a distributed, communication constrained setting, where each of $n$ users independently observes samples from an unknown $d$-dimensional spherical Gaussian distribution $\mathcal{G}(μ,\mathbb{I}_d)$, and can communicate up to $\ell$ bits to a central referee. The referee's goal is then to distinguish between cases (i) $\|μ\|_2 = 0$ versus (ii) $\|μ\|_2\ge \varepsilon$. This problem has been considered in the private- and public-coin settings, when each user holds exactly one sample, or more generally when each holds exactly $m$ samples. In this work, we significantly generalize the question in three directions: when the users only share a small number $s$ of random bits, when each user holds a different number of samples $m_k$, and when each user can send a different number of bits $\ell_k$ to the referee.

An Improved Greedy Approximation for (Metric) $k$-Means

2026-05-27T23:05:27Z

Clustering is a basic task in data analysis and machine learning, and the optimization of clustering objectives are well-studied optimization problems; amongst these, the $k$-Means objective is arguably the most well known. Given a collection of points in a metric space, the goal is to partition them into $k$ clusters, each with an associated center, so as to minimize the sum of squared distances of points to their cluster centers. In this paper, we present a polynomial-time $3+2\sqrt{2}+ε<5.83$-approximation algorithm for $k$-Means in general metrics. This substantially improves on the current-best $(9+ε)$-approximation in [Ahmadian, Norouzi-Fard, Svensson, Ward - FOCS'17, SICOMP'20], and even slightly improves on the $5.92$-approximation in [Cohen-Addad, Esfandiari, Mirrokni, Narayanan - STOC'22] for the Euclidean special case. A natural approach for $k$-Means is to leverage Lagrangian Multiplier Preserving (LMP) approximations for the facility location problem. The previous best results for $k$-Means build upon an adaptation of an LMP $3$-approximation for facility location with metric connection costs in [Jain, Vazirani - J.ACM'01] based on a primal-dual method, rather than on the improved LMP greedy $2$-approximation for the same problem in [Jain, Mahdian, Markakis, Saberi, Vazirani - J.ACM'03]. The barrier to using the improved LMP algorithm was that no adaptation of this algorithm and its analysis to the case of squared metric connection costs was known (since squared distances violate triangle inequality). Our main contribution is overcoming this barrier by providing such an adaptation. This new LMP approximation algorithm is then combined with the framework recently introduced in [Cohen-Addad, Grandoni, Lee, Schwiegelshohn, Svensson - STOC'25] for the related (metric) $k$-Median problem.

Residual-Entropy Accounting for Routed Atom-Budgeted Learned Indexes

2026-05-27T20:08:51Z

We study exact predecessor and rank search in a routed, atom-budgeted, certified-repair learned-index architecture. An ordered directory routes each query to a contiguous interval, a counted local predictor returns a certified rank window, and exact repair resolves the remaining uncertainty by comparisons. The result is scoped to this architecture and does not claim guarantees for arbitrary learned-index designs such as unconstrained RMI dispatch, hash routing, neural routing, or exact-payload indexes without additional accounting. The main parameter is conditional residual answer entropy: the entropy of the exact answer after the leaf, predictor output, certificate, and charged pre-repair information are observed. We prove a two-sided accounting theorem showing that this functional gives the query-time scale under the stated architecture and local predictor-atom budget. Directory space, sorted-array storage, and transcript-indexed repair-program space are treated as separate system costs, so the theorem is not a byte-level space lower bound or a compact implementation recipe. We also give a rank-spread specialization in which the radius term log(1 + Delta) is valid only when many residual ranks remain likely after the predictor transcript is known. For counted piecewise-linear segments, we make the profile term non-oracular, derive a shadow-price allocation rule, compute finite-instance RGapM and GapM values on real SOSD and Zenodo samples, and report benchmarks against PGM-index, RadixSpline, and binary search. The benchmarks expose overheads and bottlenecks rather than claiming speed for the shadow prototype.

Rapid GPU-Based Pangenome Graph Layout

2026-05-27T20:07:32Z

Computational Pangenomics is an emerging field that studies genetic variation using a graph structure encompassing multiple genomes. Visualizing pangenome graphs is vital for understanding genome diversity. Yet, handling large graphs can be challenging due to the high computational demands of the graph layout process. In this work, we conduct a thorough performance characterization of a state-of-the-art pangenome graph layout algorithm, revealing significant data-level parallelism, which makes GPUs a promising option for compute acceleration. However, irregular data access and the algorithm's memory-bound nature present significant hurdles. To overcome these challenges, we develop a solution implementing three key optimizations: a cache-friendly data layout, coalesced random states, and warp merging. Additionally, we propose a quantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline without layout quality loss, reducing execution time from hours to minutes.

Conjectured Bounds for 2-Local Hamiltonians via Token Graphs

2026-05-27T19:08:09Z

We explain how the maximum energy of the Quantum MaxCut, XY, and EPR Hamiltonians on a graph $G$ are related to the spectral radii of the token graphs of $G$. From numerical study, we conjecture new bounds for these spectral radii based on properties of $G$. We show how these conjectures tighten the analysis of existing algorithms, implying state-of-the-art approximation ratios for all three Hamiltonians. Our conjectures also provide simple combinatorial bounds on the ground state energy of the antiferromagnetic Heisenberg model, which we prove for bipartite graphs.

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

2026-05-27T16:30:39Z

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolution consistently outperform Darwinian evolution, confirming the great potential of local search augmented evolutionary algorithms. Notably, in the great majority of cases, all EAs outperform recent deep learning baselines and approach the performance of highly specialised heuristic and exact solvers. We furthermore report a high-performing set of generalist parameters for all studied evolution types that we hope will be of use to practitioners in future. On the theoretical side, we extend the existing Deceptive Leading Block benchmark to arbitrary block length and use tools from modern theoretical runtime analysis to prove upper and lower bounds on the expected runtime. For block lengths greater than two, Baldwinian evolution is asymptotically faster than Lamarckian which is asymptotically faster than Darwinian evolution. When accounting for the cost of the local search procedure in fitness evaluations, the ordering depends on the implementation with Baldwinian evolution staying fastest from small block lengths onwards, explaining its strong empirical performance.

Disjunctive Sum of Squares

2026-05-27T16:08:30Z

We introduce the concept of disjunctive sum of squares for certifying nonnegativity of polynomials. Unlike the popular sum of squares approach where nonnegativity is certified by a single algebraic identity, the disjunctive sum of squares approach certifies nonnegativity with multiple algebraic identities which can be found in parallel. Our main result is a disjunctive Positivstellensatz proving that we can keep the degree of each algebraic identity as low as the degree of the polynomial whose nonnegativity is in question. Based on this result, we construct a semidefinite programming based converging hierarchy of lower bounds for the problem of minimizing a polynomial over a compact basic semialgebraic set, where the size of the largest semidefinite constraint is fixed throughout the hierarchy. We further prove a second disjunctive Positivstellensatz which leads to an optimization-free hierarchy for polynomial optimization. We specialize this result to the problem of proving copositivity of matrices. Finally, we describe how the disjunctive sum of squares approach can be combined with a branch-and-bound algorithm and we present numerical experiments on polynomial, copositive, and combinatorial optimization problems.

Solving Positive Linear Programs with Differential Privacy

2026-05-27T16:02:05Z

We study differentially private approximation algorithms for positive linear programs (LPs with nonnegative coefficients and variables), focusing on the fundamental families of packing, covering, and mixed packing-covering formulations. We focus on the high-sensitivity, constraint-private regime of Hsu-Roth-Roughgarden-Ullman (ICALP 2014), where neighboring instances may differ by an arbitrary single constraint, so one cannot hope to approximately satisfy every constraint under privacy. We give private solvers that return approximate solutions while violating only a controlled number of constraints. Our algorithms improve the prior instance-dependent guarantees, and also yield new data-independent bounds that depend only on the dimension. Our techniques involve a dense multiplicative weights update method developed from a regularized dual viewpoint, which we analyze in a way that exploits structure specific to positive LPs.

Parameterized Spanning Tree Congestion

2026-05-27T12:39:02Z

In this paper we study the Spanning Tree Congestion problem, where we are given a graph $G=(V,E)$ and are asked to find a spanning tree $T$ of minimum maximum congestion. Here, the congestion of an edge $e\in T$ is the number of edges $uv\in E$ such that the (unique) path from $u$ to $v$ in $T$ traverses $e$. We consider this well-studied NP-hard problem from the point of view of (structural) parameterized complexity and obtain the following results. We resolve a natural open problem by showing that Spanning Tree Congestion is not FPT parameterized by treewidth (under standard assumptions). More strongly, we present a generic reduction which applies to (almost) any parameter of the form ``vertex-deletion distance to class $\mathcal{C}$'', thus obtaining W[1]-hardness for parameters more restricted than treewidth, including tree-depth plus feedback vertex set, or incomparable to treewidth, such as twin cover. Via a slight tweak of the same reduction we also show that the problem is NP-complete on interval graphs of modular-width $4$. Even though it is known that Spanning Tree Congestion remains NP-hard on instances with only one vertex of unbounded degree, it is currently open whether the problem remains hard on bounded-degree graphs. We resolve this question by showing NP-hardness on graphs of maximum degree 8. Complementing the problem's W[1]-hardness for treewidth...

Threshold Rules for the Classical Prophet Inequality

2026-05-27T11:48:12Z

This note records a common threshold/surplus decomposition for single-threshold stopping rules in the classical prophet inequality. The same decomposition is used to certify several deterministic thresholds, including the median, half-mean, and balanced-surplus thresholds, and to give an averaged certificate for randomized thresholds distributed as the maximum.

High-Quality Multi-Constraint Hypergraph Partitioning via Greedy Rebalancing

2026-05-27T11:36:57Z

Multi-constraint hypergraph partitioning is a generalization of balanced partitioning, where the vertex set of a hypergraph is partitioned such that the inter-block connectivity of hyperedges is minimized while balancing the vertices with regard to $d$ distinct constraints. A prominent class of applications is data distribution tasks, where this allows to achieve good load balance for $d$ different kinds of resources and simultaneously minimize the communication volume. Although the best approaches for single-constraint partitioning are usually complex (multilevel) algorithms with many components, we show that replacing only one component already leads to high-quality multi-constraint partitions: the rebalancing step, which restores balance for a partition that has (hopefully) small connectivity but violates the constraints. We design a multi-constraint rebalancing algorithm based on greedy local search, proving that balance is always restored for $d=2$ and bounded maximum weight. The key is to ensure monotonically decreasing global imbalance by choosing an imbalance metric where there is always a balance-improving move available. Integrating our algorithm into the state-of-the-art partitioner Mt-KaHyPar, we demonstrate an 11.5\,\% geometric mean connectivity reduction compared to the next best competitor (Metis) and better reliability regarding partition balance, even though the majority of inputs is outside of the theoretical guarantee.

A Deterministic Separation Lemma

2026-05-27T08:21:57Z

The \emph{Separation Lemma} is a simple yet powerful tool, akin to the well-known \emph{Isolation Lemma}, that guarantees the uniqueness of certain set sums. Bandopadhyay et al.\ introduced this lemma to establish lower bounds for the \ALP problem with respect to certain structural parameters, relying on random weight assignments in the process. The lemma's applicability extends well beyond that specific work, especially in proving hardness results. However, while effective, these hardness results inherently rely on probabilistic assumptions. In this work, we give a fully \emph{deterministic} construction for the weight assignment required by the Separation Lemma. We provide formal proofs of correctness, explicit examples, and show how deterministic weights can replace randomized ones, thereby derandomizing existing hardness results for path-packing problems. Our exposition highlights a clear progression from the original randomized foundations to deterministic constructions and their practical implications.