Acceleration by Random Stepsizes: Hedging, Equalization, and the Arcsine Stepsize Schedule

2026-05-29T19:51:57Z

We show that for separable convex optimization, random stepsizes fully accelerate Gradient Descent. Specifically, using inverse stepsizes i.i.d. from the Arcsine distribution improves the convergence rate from $O(k)$ to $O(\sqrt{k})$, where $k$ is the condition number. No momentum or other algorithmic modifications are required. Our starting point is a remarkable "equalization property" of the Arcsine distribution: it yields an identical convergence rate for all quadratic functions. A key technical insight is that martingale arguments extend this phenomenon to all separable convex functions. We interpret this equalization as an extreme form of hedging: by using this random distribution over stepsizes, Gradient Descent converges at exactly the same rate for all functions in the function class.

High-Dimensional Expanders, the Sparsest Cut Problem, and Steurer's Conjecture

2026-05-29T19:24:12Z

In 2010, Steurer conjectured that any family of $n$ unit-norm vectors $v_1,\dots,v_n$ with polynomially small average correlation $\mathbb{E}_{i,j}|\langle v_i,v_j\rangle|\leq n^{-ε}$ contains linear-sized constant-separated sets. We refute this conjecture in a strong sense using the machinery of sparse high-dimensional expanders: such vector families do not even have linear-sized $\frac{1}{\log^{1/4-o(1)}(n)}$-separated sets. Consequently, we show that there are families of vertex expanders on $n$ vertices for which the (average) $L_2$-mixing time to the uniform distribution of any reweighted simple random walk is at least $\log^{5/4-o(1)} n$.

Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms

2026-05-29T19:21:16Z

Quantization is a fundamental tool used to compress datasets, neural network weights, and memory usage in a range of computational tasks. Many downstream applications of vector quantization perform inner products with arbitrary inputs. This motivates the study of inner product aware quantization schemes that approximately preserve inner products with unseen vectors -- in contrast to simply minimizing the mean-squared error. In this work, we formulate objectives that capture natural desiderata and develop adaptive and unbiased quantization methods that approximately preserve inner products with worst-case and average-case inputs. An analysis of these objectives shows a tight connection with the well-studied notion of Adaptive Stochastic Quantization (ASQ). We develop provably fast exact and approximate algorithms for our objectives. Our theoretical results inspire efficient practical algorithms that perform well across a variety of workload distributions. They also lead to practical algorithms for standard ASQ which are 2-10$\times$ faster than prior state-of-the-art methods while maintaining quality. These theoretical and empirical results contribute towards making adaptive quantization techniques more efficient and tractable in practical settings.

An Upper Bound on Grothendieck's Constant

2026-05-29T18:28:09Z

We show that Grothendieck's real constant $K_G$ can be upper bounded by projecting vectors onto a random plane through the origin and thresholding a degree five Hermite polynomial. This resolves a conjecture of Braverman-Makarychev-Makarychev-Naor from 2011, who required an extra randomization step in their rounding scheme and proved $K_G<\fracπ{2\log(1+\sqrt{2})}-10^{-500}$. As a corollary of our result, we prove the bound $K_G<\fracπ{2\log(1+\sqrt{2})}-10^{-217}$ by thresholding degree three Hermite polynomials in the plane. We finally give a rigorous computer-assisted proof that $K_G<\fracπ{2\log(1+\sqrt{2})}-10^{-5}$ using interval arithmetic and degree three Hermite polynomial thresholding.

Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm

2026-05-29T15:21:11Z

In this paper, we show the possibility of a direct injection of algorithms into neural network architecture. We focus on a complex algorithm, that is, Cocke-Youger-Kasami (CYK) for parsing context-free grammars in Chomsky Normal Form and we propose CYKNN, a simple recurrent neural network architecture for encoding the CYK algorithm in trainable matrix-vector multiplications.We experimented with a very simple grammar with 4 variations showing that our approach outperforms existing LLMs with more than 20B parameters with an in-context learning setting and smaller LLMs of the Qwen family fine-tuned with LoRA. Our attempt paves the way to a different approach to neuro-symbolic methodologies.

An Optimal Algorithm for Binary Closest String

2026-05-29T15:18:49Z

We revisit the Binary Closest String problem, which asks, given a set of binary strings $X \subseteq \{0, 1\}^n$, to compute a string minimizing the maximum Hamming distance to $X$. A long line of work has focused on parameterized algorithms with respect to the optimal distance $d$, yielding a sequence of improvements from $O^*(d^d)$ through $O^*(16^d)$, $O^*(9.513^d)$, $O^*(8^d)$, $O^*(6.731^d)$ to the current best-known running time of $O^*(5^d)$ [Chen, Ma, Wang; Algorithmica '16]. We present a faster randomized algorithm running in time $O^*(4^d)$. Our result matches a recent fine-grained lower bound [Abboud, Fischer, Goldenberg, Karthik C.S., Safier; ESA '23], and is therefore conditionally optimal. As an extra benefit, our algorithm is remarkably simple.

A Distribution Testing Approach to Clustering Distributions

2026-05-29T13:21:46Z

We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes.

Retriever Portfolios: A Principled Approach to Adaptive RAG

2026-05-29T11:43:05Z

Retrieval-augmented generation (RAG) systems typically rely on a single retriever and a single set of hyperparameters, despite facing highly heterogeneous queries that range from simple factoid questions to complex multi-hop reasoning. We propose a method that automatically selects a small, diverse subset of retrievers (a portfolio) from a large pool of candidates, to cover different regions of the target query distribution. We formalize this setting via an expected best-of-$k$ objective over the query distribution and show that it admits an efficient portfolio construction algorithm with near-optimal guarantees. Across multiple QA benchmarks, our learned portfolios and router pipeline consistently outperform single-retriever and naive multi-retriever baselines on both retrieval metrics and answer quality. In addition, compared to inference-time hyperparameter tuning approaches, fixed portfolios enable parallel retrieval and LLM calls, achieving comparable (and sometimes better) accuracy with substantially lower latency and token cost.

Improved Approximation Algorithms and Hardness Results for Shortest Common Superstring with Reverse Complements

2026-05-29T07:49:57Z

The Shortest Common Superstring (SCS) problem is a fundamental task in sequence analysis. In genome assembly, however, the double-stranded nature of DNA implies that each fragment may occur either in its original orientation or as its reverse complement. This motivates the Shortest Common Superstring with Reverse Complements (SCS-RC) problem, which asks for a shortest string that contains, for each input string, either the string itself or its reverse complement as a substring. The previously best-known approximation ratio for SCS-RC was $\frac{23}{8}$. In this paper, we present a new approximation algorithm achieving an improved ratio of $\frac{8}{3}$. Our approach computes an optimal constrained cycle cover by reducing the problem, via a novel gadget construction, to a maximum-weight perfect matching in a general graph. We also investigate the computational hardness of SCS-RC. While the decision version is known to be NP-complete, no explicit inapproximability results were previously established. We show that the hardness of SCS carries over to SCS-RC through a polynomial-time reduction, implying that it is NP-hard to approximate SCS-RC within a factor better than $\frac{333}{332}$. Notably, this hardness result holds even for the DNA alphabet.

Deterministic Monotone Min-Plus Product and Convolution

2026-05-29T04:49:30Z

The Monotone Min-Plus Product problem is a useful primitive that has seen many algorithmic applications over the past decade. In this problem, we are given two $n\times n$ integer matrices $A$ and $B$, where each row of $B$ is a monotone non-decreasing sequence of integers from $\{1,\dots,n\}$, and the goal is to compute their Min-Plus product, defined as the $n\times n$ matrix $C$ with $C_{i,j} = \min_{k}\{A_{i,k} + B_{k,j}\}$. The fastest known algorithm for this task [Chi, Duan, Xie, and Zhang, STOC'22] runs in $n^{(ω+3)/2+o(1)} = O(n^{2.686})$ time, significantly improving over the brute-force cubic algorithm. However, its main disadvantage is that it requires randomization, which is then inherited by all downstream applications. Our main result is a deterministic algorithm for Monotone Min-Plus product with the same time complexity $n^{(ω+3)/2+o(1)} = O(n^{2.686})$ as its randomized counterpart, improving upon the previous deterministic bound $O(n^{2.875})$ [Gu, Polak, Vassilevska Williams, and Xu, ICALP'21]. Our derandomization also applies to previously studied extensions and variants (e.g., [Dürr, IPL'23]), including rectangular matrices, bounded range $[n^μ]$, and column-monotone matrices. As an immediate consequence, we derandomize state-of-the-art algorithms for multiple problems, including Language Edit Distance, RNA Folding, Optimum Stack Generation, unweighted Tree Edit Distance, Batched Range Mode, and Approximate All-Pairs Shortest Paths. Our techniques also yield a deterministic algorithm for the Monotone Min-Plus Convolution problem that runs in $n^{1.5+o(1)}$ time, nearly matching the best known randomized time complexity $\widetilde{O}(n^{1.5})$ [Chi, Duan, Xie, and Zhang, STOC'22]. This algorithm can be used to derandomize state-of-the-art algorithms for Jumbled Indexing for binary strings and several variants of Knapsack.

Incremental BPE Tokenization

2026-05-29T04:04:32Z

We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure defined by a fixed set of merge rules. This enables efficient partial tokenization in streaming settings. Functioning as a drop-in replacement for standard BPE, our approach achieves a speedup of up to ${\sim}3\times$ over Hugging Face's tokenizers, and demonstrates significant latency reductions over OpenAI's tiktoken on pathological inputs. We further introduce an eager output algorithm that enables streaming output, emitting tokens as soon as token boundaries are determined during incremental tokenization. Overall, our results demonstrate that BPE tokenization can be performed incrementally with strong worst-case guarantees, while providing practical latency benefits in modern large language model pipelines. Code: https://github.com/ModelTC/mtc-inc-bpe

Optimal Rates for Differentially Private Hypothesis Testing with E-values

2026-05-29T03:39:02Z

E-values have attracted considerable interest in recent years as flexible tools for enabling anytime-valid and adaptive data analysis. Hypothesis testing is at the core of many of these applications, which can often involve private or sensitive data. In this work, we answer a simple but important question: given two distributions $\mathbb{P}$ and $\mathbb{Q}$, what is the maximum achievable e-power when testing $X\sim \mathbb{P}^n$ against $X\sim\mathbb{Q}^n$ with e-values that satisfy $\varepsilon$-differential privacy? We characterize the optimal rate for this problem and provide an algorithm which matches it exactly. In the sequential setting, when observations arrive one-by-one and the analyst chooses when to halt, we give matching upper and lower bounds on the stopping times of any private e-process. Numerical experiments confirm the practicality of our algorithms, which require less data than the recently proposed DP-SPRT across a range of sequential testing problems and privacy levels.

Listing Even Cycles Faster than the Submodular-Width Barrier

2026-05-28T20:53:42Z

A classic result of Alon, Yuster, and Zwick (AYZ, Algorithmica 1997) shows that all $2k$-cycles in an $m$-edge graph can be listed in $\tilde O(m^{2-1/k}+t)$ time, where $t$ is the output size. This bound underlies the {\em submodular width} of Marx (JACM 2013) and the PANDA framework of Abo Khamis, Ngo, and Suciu (PODS 2017), which extend AYZ to arbitrary conjunctive queries with degree constraints. A central open question is whether combinatorial algorithms can beat the submodular-width barrier. Bringmann and Gorbachev (STOC 2025) gave lower-bound evidence that submodular width may be optimal for general conjunctive queries under combinatorial algorithms. The picture changes for $2k$-cycles on undirected graphs, whose queries have self-joins and symmetric EDBs: recent works improve on AYZ for even-cycle detection and listing. Pinning down the complexity of $C_{2k}$-detection and listing is thus a natural step toward overcoming the submodular-width barrier for such queries. For detection, Dahlgaard, Knudsen, and St{ö}ckel (STOC 2017) solved $C_{2k}$-detection in $\tilde O(m^{2k/(k+1)})$ time. Listing is harder. Jin and Xu (STOC 2023), and independently Abboud, Khoury, Leibowitz, and Safier (FSTTCS 2023), listed 4-cycles in $\tilde O(m^{4/3}+t)$ time; Vassilevska~Williams and Westover (ITCS 2025) listed 6-cycles in $\tilde O(m^{8/5}+t)$ time, improving the AYZ bounds of $\tilde O(m^{3/2})$ and $\tilde O(m^{5/3})$. The general case has remained open for 30 years. Building on these works, we list $2k$-cycles in $\tilde O(m^{(2k^2-k+1)/(k^2+1)}+t)$ time, improving AYZ for every $k\geq 3$. The key ingredient is an \emph{asymmetric supersaturation} result for even cycles. Our algorithms use only join and project operators over multiple tree-decomposition plans, making them naturally implementable in database systems, in contrast to prior BFS-based graph approaches.

Energy-Efficient Aggregation and Minimum-Degree Spanning Trees in Radio Networks

2026-05-28T20:28:08Z

We study the aggregation problem in synchronous multi-hop radio networks with $O(\log n)$-bit messages and no collision detection. Each node initially holds a value, and the goal is to compute a global aggregate such as the sum of all values. Aggregation tasks arise naturally in wireless sensor networks, where nodes are often battery-powered and radio activity is the dominant source of energy consumption. Accordingly, our main objective is to minimize the energy complexity, defined as the maximum number of rounds in which any node is awake. Our main result is a randomized distributed algorithm that, with high probability, constructs and executes an aggregation schedule in $O(n \operatorname{polylog} n)$ rounds and using $O(Δ^\ast \operatorname{polylog} n)$ energy, where $Δ^\ast$ is the minimum possible maximum degree of a spanning tree of the network graph. This guarantee is nearly optimal: for any aggregation schedule and any graph, there exists a node that must be awake for at least $Δ^\ast$ rounds. As a by-product, the algorithm also computes a spanning tree whose maximum degree is within an $O(\log n)$ factor of $Δ^\ast$, with the same round and energy guarantees. For every tree edge, both endpoints learn that the edge belongs to the tree.

Privately Estimating Monotone Statistics in Polynomial Time

2026-05-28T20:19:03Z

We study efficient differentially private algorithms for estimating monotone statistics, i.e., statistics that are monotone under the addition of new observations. The starting point for our investigation is subsample-and-aggregate: a classical paradigm that partitions the dataset into blocks, estimates the statistic on each block, and then privately aggregates the estimates. While practical and generically applicable, this approach is quite data-hungry. We improve upon this framework for the class of monotone statistics -- compared to subsample-and-aggregate, our algorithms save a factor of $t$ in sample complexity and pay a factor of $e^t$ in running time, where $t>0$ is a tunable parameter. We complement our results with a query-complexity lower bound, showing that our algorithms are essentially optimal for this task. As an application, we obtain improved results for private eigenvalue estimation, private loss estimation, and privately estimating a single parameter of a high-dimensional model, e.g., in linear regression.