Optimal Phylogenetic Reconstruction from Sampled Quartets

April 19, 2026 ยท Grace Period ยท ๐Ÿ› STOC 2026

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Dionysis Arvanitakis, Vaggos Chatziafratis, Yiyuan Luo, Konstantin Makarychev arXiv ID 2604.17461 Category cs.DS: Data Structures & Algorithms Citations 0 Venue STOC 2026
Abstract
Quartet Reconstruction, the task of recovering a phylogenetic tree from smaller trees on four species called \textit{quartets}, is a well-studied problem in theoretical computer science with far-reaching connections to statistics, graph theory and biology. Given a random sample containing $m$ noisy quartets, labeled by an unknown ground-truth tree $T$ on $n$ taxa, we want to output a tree $\widehat T$ that is \textit{close} to $T$ in terms of quartet distance and can predict unseen quartets. Unfortunately, the empirical risk minimizer corresponds to the $\mathsf{NP}$-hard problem of finding a tree that maximizes agreements with the sampled quartets, and earlier works in approximation algorithms gave $(1-\eps)$-approximation schemes (PTAS) for dense instances with $m=ฮ˜(n^4)$ quartets, or for $m=ฮ˜(n^2\log n)$ quartets \textit{randomly} sampled from $T$. Prior to our work, it was unknown how many samples are information-theoretically required to learn the tree, and whether there is an efficient reconstruction algorithm. We present optimal results for reconstructing an unknown phylogenetic tree $T$ from a random sample of $m=ฮ˜(n)$ quartets, corrupted under the Random Classification Noise (RCN) model. This matches the $ฮฉ(n)$ lower bound required for any meaningful tree reconstruction. Our contribution is twofold: first, we give a tree reconstruction algorithm that, not only achieves a $(1-\eps)$-approximation, but most importantly \textit{recovers} a tree close to $T$ in quartet distance; second, we show a new $ฮ˜(n)$ bound on the Natarajan dimension of phylogenies (an analog of VC dimension in multiclass classification). Our analysis relies on a new \textit{Quartet-based Embedding and Detection} procedure that identifies and removes well-clustered subtrees from the (unknown) ground-truth $T$ via semidefinite programming.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Data Structures & Algorithms