CAFS: A Cache-Aware Frequency Sort for Low-Cardinality Integer Data on x86-64

2026-05-24T12:37:36Z

Integer sorts in OLAP engines often run on columns whose cardinality $K$ is much smaller than the array length $N$. After a group-by stage the intermediate key column has $K$ bounded by the number of distinct group keys, and even a column-store scan typically operates on dictionary-encoded categorical fields where $K$ never exceeds a few thousand. A comparison sort on such a column still pays $Θ(N \log N)$ comparisons, and a radix sort still pays $Θ(N \cdot B/b)$ byte passes, irrespective of $K$. This paper describes CAFS, an integer sort that does exploit it on x86-64 with AVX2. The algorithm combines a SIMD bucket sized to one cache line, a Chao1 cardinality estimator over 1024 strided samples (kept in a heap-allocated 40 KB open-addressing table), and an adaptive dispatcher backed by a spill safety guard. The hot loop is branchless and uses AVX2 cmpeq together with movemask and tzcnt to locate the matching lane. We benchmarked CAFS on a full-factorial grid of 58 array sizes $N$ from $10^3$ to $3 \cdot 10^7$ with dense $K$ schedules per $N$, producing 592770 timed runs against pdqsort, IPS4o, vqsort, ska_sort, and std::sort. In the $K \ll N$ band the throughput is 1.7 to 3.1x that of pdqsort, 1.7 to 3.5x IPS4o, and 1.2 to 2.3x vqsort. The operational crossover against pdqsort is at $K \approx 1.3 \cdot 10^5$; against ska_sort, $K \approx 8.14 \cdot 10^5$; against vqsort, $K \approx 6.7 \cdot 10^5$; and against IPS4o the curves only converge near $K = N$. Of the five baselines, only vqsort actually overtakes CAFS once the crossover is passed, which makes the vqsort threshold at $K \approx 6.7 \cdot 10^5$ the binding constraint on the operational range of CAFS.

Robust Permutation Flowshops Under Budgeted Uncertainty

2026-05-24T08:57:52Z

We consider the robust permutation flowshop problem under the budgeted uncertainty model, where at most a given number of job processing times may deviate on each machine. We show that solutions for this problem can be determined by solving polynomially many instances of the corresponding nominal problem. As a direct consequence, our result implies that this robust flowshop problem can be solved in polynomial time for two machines, and can be approximated in polynomial time for any fixed number of machines. The reduction that is our main result follows from an analysis similar to Bertsimas and Sim (2003) except that dualization is applied to the terms of a min-max objective rather than to a linear objective function. Our result may be surprising considering that heuristic and exact integer programming based methods have been developed in the literature for solving the two-machine flowshop problem. Next, we show a logarithmic factor improvement in the overall running time implied by a naive reduction to nominal problems in the case of two machines and three machines. We conclude by noting that our reduction appears to have more general consequences for robust optimization problems under budgeted uncertainty having a similar form.

Approximation algorithms for the prize-collecting rural postman problem

2026-05-24T08:45:58Z

In this paper, we study the prize-collecting rural postman problem (PCRPP), a variant of the rural postman problem. Given a PCRPP instance consisting of an undirected graph whose edges have nonnegative lengths and nonnegative profits, together with a root vertex, the goal is to find a closed walk that starts and ends at the root vertex and minimizes the sum of its length and the profits of all edges that the walk does not traverse. A natural way to design an approximation algorithm for the PCRPP is to construct a prize-collecting traveling salesman problem (PCTSP) instance from the given PCRPP instance, apply an approximation algorithm to the PCTSP instance, and then convert the resulting PCTSP solution into a PCRPP solution. We show that this approach has an inherent factor-two barrier: even if the constructed PCTSP instance is solved exactly, the resulting PCRPP solution can have objective value arbitrarily close to twice the optimum value of the original PCRPP instance. Our main result is a polynomial-time approximation algorithm with approximation ratio strictly smaller than 1.6 for the PCRPP. On a public 118-instance benchmark set, the proposed algorithm has average and maximum optimality gaps of 3.39\% and 12.12\%, respectively.

Lattice Structure and Efficient Basis Construction for Strongly Connected Orientations

2026-05-24T07:39:16Z

Let $\vec{G}=(V,E^+\cup E^-)$ be a bidirected graph whose underlying undirected graph $G=(V,E)$ is $2$-edge-connected. A strongly connected orientation (SCO) is defined as a subset of arcs that contains exactly one of $e^+,e^-$ for every $e\in E$ and induces a strongly connected subgraph of $\vec{G}$. Given a family $\mathcal{F}$ of proper subsets of $V$, we call an SCO tight if there is exactly one arc entering $U$ for every $U\in \mathcal{F}$. We give a polynomial-time algorithm to construct a set $\mathcal{B}$ consisting of tight SCO's which forms an integral basis for the linear hull of tight SCO's. That is, $\mathcal{B}$ is a linearly independent subset of tight SCO's, and every integer vector in the linear hull of tight SCO's can be written as an integral combination of $\mathcal{B}$. This extends the main result of Abdi, Conuéjols, Liu and Silina (IPCO 2025), who gave a non-constructive proof of the existence of such a basis in an equivalent setting. While the previous proof uses polyhedral theory, our proof is purely combinatorial and yields a polynomial-time algorithm. As an application of our algorithm, we show that parity-constrained tight strongly connected orientations can be solved in deterministic polynomial time. Along the way, we discover appealing connections to the theory of perfect matching lattices.

New and Improved Bounds for Markov Paging

2026-05-24T05:02:19Z

In the Markov paging model, one assumes that page requests are drawn from a Markov chain over the pages in memory, and the goal is to maintain a fast cache that suffers few page faults in expectation. While computing the optimal online algorithm $(\mathrm{OPT})$ for this problem naively takes time exponential in the size of the cache, the best-known polynomial-time approximation algorithm is the dominating distribution algorithm due to Lund, Phillips and Reingold (FOCS 1994), who showed that the algorithm is $4$-competitive against $\mathrm{OPT}$. We substantially improve their analysis and show that the dominating distribution algorithm is in fact $2$-competitive against $\mathrm{OPT}$. We also show a lower bound of $1.5907$-competitiveness for this algorithm -- to the best of our knowledge, no such lower bound was previously known.

Improved Dual Attack and Trapdoor Sampling via Quantum Rejection Sampling

2026-05-24T01:01:43Z

In this work, we revisit the dual attack and GPV trapdoor sampling, focusing on the lattice Gaussian sampling term, which can be a significant bottleneck in the overall complexity. We show that this sampling step can be quantumly accelerated by combining the lower bound underlying Wang and Ling's analysis of Klein's algorithm with the quantum rejection sampling (QRS) framework proposed by Ozols et al. Specifically, this lower bound gives precisely the pointwise domination condition required for quantum rejection sampling when given coherent oracle access to a truncated Klein proposal distribution, which yields a quantum procedure for preparing the truncated dual $q$-ary lattice Gaussian with a quadratic reduction in the sampling complexity. The truncation radius is chosen so that the truncated distribution is negligibly close to the full lattice Gaussian in total variation distance. Substituting this sampler into the dual attack framework results in reduced overall attack-cost estimates. Compared with Pouly and Shen's modern dual attack under the same parameter choices, our estimates reduce the attack cost by $9$, $4$, and $13$ bits for Kyber-512, Kyber-768, and Kyber-1024, respectively. We also report the corresponding estimates with modulus switching. Finally, by replacing the Markov chain Monte Carlo (MCMC) sampler with the QRS algorithm, we achieve a similar quadratic speedup in the GPV signing process.

A computational phase transition for learning-to-sample from Ising models

2026-05-23T22:04:30Z

We study \emph{learning-to-sample} -- a basic algorithmic task underlying generative modeling -- for Ising models, a standard testbed for algorithmic ideas in both theoretical computer science and machine learning. Given i.i.d. samples of an unknown target distribution, the goal of learning-to-sample is to learn a computationally efficient generation procedure that produces new samples following approximately the same distribution. We construct a family of Ising models of constantly bounded-width which lie just beyond the spectral threshold $λ_{\max}(J)-λ_{\min}(J)=1$, and show that learning-to-sample for this family is computationally hard under standard cryptographic assumptions, even when the learner is given both polynomially many i.i.d. samples from the model and explicit access to its parameters. Combined with results of [AJKPV24,KLV25] showing tractability of learning-to-sample below the spectral threshold, this establishes a sharp computational phase transition at the spectral threshold. Moreover, combined with prior results on parameter learning for bounded-width Ising models [KM17,WSD19,VML20], this shows that learning-to-sample can be more difficult than parameter learning. Finally, we show that any efficient learner for these hard instances exhibits a natural memorization-hallucination dichotomy: the learner must either output configurations that, after a simple transformation, match the (transformed) training data or place substantial mass on configurations of negligible probability under the target distribution.

Covering vertices by sequential stars

2026-05-23T19:31:47Z

We study the problem of covering the maximum number of vertices in a graph by a collection of vertex-disjoint stars, each with a number of satellites in a given interval $[k, \ell]$, where $1 \le k < \ell$ and $\ell$ can be infinity. This is referred to as sequential {\sc $[k, \ell]$-Star Packing} problem. It is solvable in polynomial time when $k = 1$, but becomes strongly NP-hard when $k \ge 2$. In this paper, we propose either the first or an improved approximation algorithm for the following four sequential settings: 1) a $\frac {k+1}2$-approximation algorithm when $k \ge 3$ and $\ell = \infty$, improving the previous best ratio of $\frac {(k+1)^2}{2k+1}$; 2) a $\frac 43$-approximation algorithm when $k = 2$ and $\ell = \infty$, improving the previous best ratio of $\frac 32$; 3) the first $(1 + \frac \ell{\ell+1})$-approximation algorithm when $2 = k < \ell$; and 4) the first $(1 + \max\left\{\frac {k-1}2, \frac {(k+1) \ell}{3 (\ell+1)}\right\})$-approximation algorithm when $3 \le k < \ell$. Besides the main algorithmic techniques being local search coupled with amortized analysis, we observe augmenting configurations to bridge two distant neighborhoods for a local improvement operation. Additionally, the problem has been shown APX-hard when $k \ge 3$; we prove its APX-hardness for the last remaining case where $k = 2$.

Optimal Smoothed Analysis of the Simplex Method

2026-05-23T12:59:46Z

Smoothed analysis is a method for analyzing the performance of algorithms, used especially for those algorithms whose running time in practice is significantly better than what can be proven through worst-case analysis. Spielman and Teng (STOC '01) introduced the smoothed analysis framework of algorithm analysis and applied it to the simplex method. Given an arbitrary linear program with $d$ variables and $n$ inequality constraints, Spielman and Teng proved that the simplex method runs in time $O(σ^{-30} d^{55} n^{86})$, where $σ> 0$ is the standard deviation of Gaussian distributed noise added to the original LP data. Spielman and Teng's result was simplified and strengthened over a series of works, with the current strongest upper bound being $O(σ^{-3/2} d^{13/4} \log(n)^{7/4})$ pivot steps due to Huiberts, Lee and Zhang (STOC '23). We prove that there exists a simplex method whose smoothed complexity is upper bounded by $O(σ^{-1/2} d^{11/4} \log(n)^{7/4})$ pivot steps. Furthermore, we prove a matching high-probability lower bound of $Ω( σ^{-1/2} d^{1/2}\ln(4/σ)^{-1/4})$ on the combinatorial diameter of the feasible polyhedron after smoothing, on instances using $n = \lfloor (4/σ)^d \rfloor$ inequality constraints. This lower bound indicates that our algorithm has optimal noise dependence among all simplex methods, up to polylogarithmic factors.

Fermi-Dirac machines as quantizations of neurons

2026-05-23T04:09:03Z

Fermi-Dirac machines were proposed recently as an approach to solving semidefinite optimization problems on quantum computers. Here, we reinterpret them as canonical quantizations of classical neurons. By viewing a classical neuron as an activation function applied to a parameterized classical Hamiltonian, we quantize this model by replacing classical variables with operators whose eigenvalues encode their possible values. This follows the standard approach to canonical quantization in quantum mechanics. Crucially, when the Hamiltonian consists of commuting operators, our construction reduces exactly to a classical neuron. More generally, our approach yields an activation observable, defined as an activation function applied to a parameterized quantum Hamiltonian. The output of this quantized neuron is a random variable with expectation value equal to that of the activation observable with respect to an input state. We develop efficient hybrid quantum-classical algorithms for evaluating outputs and gradients of our quantized neurons, enabling evaluation and training. These algorithms rely on basic primitives that include random sampling, Hamiltonian simulation, and the Hadamard test. We also quantize a whole host of other activation functions, including the smooth rectified linear unit (ReLU), sigmoid linear unit, Gaussian-smoothed ReLU, and Gaussian error linear unit (GeLU), which are known to be useful for deep learning applications. Numerical experiments indicate that neurons based on quantum Hamiltonians can learn functions that classical neurons cannot. We further define a computational decision problem based on Fermi-Dirac neurons and prove that it is BQP-complete, providing complexity-theoretic evidence against efficient classical simulation. Finally, we generalize our approach to continuous quantum variables and sketch two different ways of composing these neurons into networks.

A Comprehensive Evaluation of Vertex Elimination Algorithms for Algorithmic Differentiation

2026-05-22T23:15:37Z

The algorithmic differentiation (AD) of mathematical functions can be interpreted as a sequence of vertex eliminations in an underlying directed acyclic graph. The problem of determining a minimum-cost elimination ordering, which we call Optimal Vertex Elimination, is NP-complete. Consequently, much effort has been devoted to the design of heuristics. Many of these heuristics are widely believed to perform well in practice, but this hypothesis has so far been difficult to test due to the lack of scalable exact methods. We design and engineer new integer programming formulations for Optimal Vertex Eliminatioin and for a related objective we call Minimum Edge Count. Our implementations scale to graphs one-to-two orders of magnitude larger than existing techniques, enabling the assembly of a corpus of medium-sized graphs for which optimal solutions are known. This corpus facilitates a study of existing heuristics, confirming that on real data popular methods achieve high quality solutions. We also make several theoretical contributions. We give a tight analysis of the forward and reverse modes of AD, and extend our techniques to provide a simple algorithm for Optimal Vertex Elimination with approximation ratio parameterized by the size of a minimum source-sink separator. On the complexity side, we give the first approximation lower bounds for both problems.

Towards Universal Convergence of Backward Error in Linear System Solvers

2026-05-22T19:35:55Z

The quest for an algorithm that solves an $n\times n$ linear system in $O(n^2)$ time complexity, or $O(n^2 \text{poly}(1/ε))$ when solving up to $ε$ relative error, is a long-standing open problem in numerical linear algebra and theoretical computer science. There are two predominant paradigms for measuring relative error: forward error (i.e., distance from the output to the optimum solution) and backward error (i.e., distance to the nearest problem solved by the output). In most prior studies, convergence of iterative linear system solvers is measured via various notions of forward error, and as a result, depends heavily on the conditioning of the input. Yet, the numerical analysis literature has long advocated for backward error as the more practically relevant notion of approximation. In this work, we show that -- surprisingly -- the classical and simple Richardson iteration incurs at most $1/k$ (relative) backward error after $k$ iterations on any positive semidefinite (PSD) linear system, irrespective of its condition number. This universal convergence rate implies an $O(n^2/ε)$ complexity algorithm for solving a PSD linear system to $ε$ backward error, and we establish similar or better complexity when using a variety of Krylov solvers beyond Richardson. Then, by directly minimizing backward error over a Krylov subspace, we attain an even faster $O(1/k^2)$ universal rate, and we turn this into an efficient algorithm, MINBERR, with complexity $O(n^2/\sqrtε)$. Finally, we extend this approach via normal equations to solving general linear systems in $O(n^2\log(n)/ε)$ time complexity. We report strong numerical performance of our algorithms on benchmark problems.

A Tight Bound on Localization of Electrical Flows

2026-05-22T18:48:17Z

We prove that for any unweighted graph on n vertices the L1 norm of a unit electric current between the endpoints of a random edge is at most 2 log n. Furthermore, we show that on any weighted graph the spectral norm of the entry-wise absolute value of the symmetric transfer-current matrix is at most 2 log n. This bound is tight up to constants and improves the O(log^2 n) bound from [Schild-Rao-Srivastava, SODA '18]. The initial proofs were generated by OpenAI's ChatGPT 5.5 Pro; the authors have verified and rewritten them to enhance readability and provide additional context.

Linear Regression with Unknown Truncation Beyond Gaussian Features

2026-05-22T17:08:02Z

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

Optimal Dimension-Free Sampling for Regularized Classification

2026-05-22T15:05:33Z

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.