Scaling hierarchical agglomerative clustering to trillion-edge graphs

In a line of labor beginning in 2021, we fastidiously investigated the complexity and parallel scalability of HAC, from a sparse graph-theoretic perspective. We noticed that in apply, solely m << n² of essentially the most related entries within the similarity matrix had been really related for computing high-quality clusters. Specializing in this sparse setting, we got down to decide if we may design HAC algorithms that exploit sparsity to attain two key objectives:

  1. various computation steps proportional to the variety of similarities (m), quite than n², and
  2. excessive parallelism and scalability.

We began by fastidiously exploring the primary aim, since reaching this might imply that HAC might be solved a lot sooner on sparse graphs, thus presenting a path for scaling to extraordinarily giant datasets. We proved this risk by presenting an environment friendly sub-quadratic work algorithm for the issue on sparse graphs which runs in sub-quadratic work (particularly, O(nm½) work, as much as poly-logarithmic elements). We obtained the algorithm by designing a cautious accounting scheme that mixes the basic nearest-neighbor chain algorithm for HAC with a dynamic edge-orientation algorithm.

Nonetheless, the sub-quadratic work algorithm requires sustaining a sophisticated dynamic edge-orientation knowledge construction and isn’t straightforward to implement. We complemented our precise algorithm with a easy algorithm for approximate average-linkage HAC, which runs in nearly-linear work, i.e., O(m + n) as much as poly-logarithmic elements, and is a pure rest of the grasping algorithm. Let ε be an accuracy parameter. Then, extra formally, a (1+ε)-approximation of HAC is a sequence of cluster merges, the place the similarity of every merge is inside an element of (1+ε) of the best similarity edge within the graph at the moment (i.e., the merge that the grasping precise algorithm will carry out).

Experimentally, this notion of approximation (say for ε=0.1) incurs a minimal high quality loss over the precise algorithm on the identical graph. Moreover, the approximate algorithm additionally yielded giant speedups of over 75× over the quadratic-work algorithms, and will scale to inputs with tens of thousands and thousands of factors. Nonetheless, our implementations had been gradual for inputs with quite a lot of hundred million edges as they had been solely sequential.

The following step was to try to design a parallel algorithm that had the identical work and a provably low variety of sequential dependencies (formally, low depth, i.e., longest crucial path). In a pair of papers from NeurIPS 2022 and ICALP 2024, we studied the best way to get hold of good parallel algorithms for HAC. First, we confirmed the frequent knowledge that precise average-linkage HAC is tough to parallelize as a result of sequential dependencies between successive grasping merges. Formally, we confirmed that the issue is as laborious to parallelize as some other drawback solvable in polynomial time. Thus, barring a breakthrough in complexity idea, average-linkage HAC is unlikely to confess quick parallel algorithms.

On the algorithmic aspect, we developed a parallel approximate HAC algorithm, known as ParHAC, that we present is very scalable and runs in near-linear work and poly-logarithmic depth. ParHAC works by grouping the sides into O(log n) weight lessons, and processing every class utilizing a fastidiously designed low-depth symmetry-breaking algorithm. ParHAC enabled the clustering of the WebDataCommons hyperlink graph, one of many largest publicly out there graph datasets with over 100 billion edges, in only a few hours utilizing a reasonable multicore machine.