(H)DBSCAN

Back to clustering algorithms: a look at a classic, DBSCAN (and its new H-version).


What is DBSCAN?

DBSCAN, i.e. Density-based spatial clustering of applications with noise, is a very old algorithm (in today’s world). First published in 1996, it has since become a classic, for a few reasons:

  • It’s easy to understand
  • It’s actually a true clustering algorithm (as opposed to $k$-means, which is more partitioning than clustering)
  • It generally works (until it doesn’t)

In this subsection we introduce DBSCAN and how to use it, along with its limitations. I’ll be referring to the original paper, which is linked above and is a surprisingly easy read.

In the following subsection we add on the H to get HDBSCAN.

How does DBSCAN work?

The key concept that the authors try to define and highlight is that of “density”. When we see an area of points, how do we intuitively tell it is dense, and vice versa? By coming up with a definition of density, they can then use it to define clusters.

Their premise is such:

  • An area is dense (and therefore a cluster) when it has many points in a small area.
  • Essentially that means that each point has a lot of neighbours in a small distance.
  • Therefore:
    • Start off by picking min_samples, i.e. how many neighbours must you have to be dense?
    • Continue by picking eps, i.e. how small is that distance we just mentioned?
  • With that, they define a few ideas:
    • an eps-neighbourhood for each point $p$, i.e. how many neighbours are there within eps distance from $p$?
    • the concept of directly density-reachable (DDR), i.e. if point $q$ has enough (aka at least min_samples eps-neighbours, and $p$ is one of said neighbours, then $p$ is directly density-reachable from $q$. It’s a bit like saying, if you’re the popular kid and I’m close to you, I’m directly density-reachable from you
    • Extends it to density-reachable, i.e. if I am DDR to Person A, and Person A is DDR to Person B, I am density-reachable to Person B. Makes sense in such a connected world, friends of friends of popular people…
    • Now a reverse concept, density-connected. If you can find one point $o$ that is density-reachable to both $p$ and $q$, then $p$ and $q$ are density-connected. I.e. if I know of famous person O through some famous friend’s famous friend, and you also know of O through your other famous friend’s famous friend then we are density-connected.
  • Putting this concept together:
    • A cluster is something that has all points density-connected. Further, each point’s density reachable points are all in the cluster.
    • Extending the famous people analogy:
      • Everyone in said cluster knows some number of famous people in the cluster, and
      • Everyone who can claim they know (through some number of famous friends) at least famous person in the cluster is in the cluster
    • Makes sense? So if you’re a loner, you’re gonna be discarded, because you don’t know anyone. Or you know the wrong group of friends (the unpopular ones). In DBSCAN, you kinda have to know the Important Points to be in the cluster.
      • Which makes sense! That’s how you trap outliers.
  • Then build clusters based on this concept (algorithm is here, page 4))
    • Basically, start by picking a point
    • If it’s not already labelled:
      • is it popular/famous/influential (eps-neighbourhood has at least min_samples eps-neighbours)?
        • If not, it’s a fringe member of society. Label it as noise - BUT it could be a fringe member of a cluster
        • If otherwise, it’s a cluster. Find its members:
          • All its friends are automatically part of its cluster, even those initially labelled as fringe.
          • Check its friends’ friends to see if they are famous influential points, and repeat until exhausted.
          • Then pick another point randomly and restart the process till no more points are left.

What are its issues?

Unfortunately, DBSCAN isn’t the fastest in the world. The algorithm as described above has a runtime complexity of $Θ(n^2.d)$*, which does make it slow. There have been advances in computation but it can’t solve this issue too much.

Furthermore, you have to be careful playing with the two parameters, min_samples and eps. Those change things a lot, as this wonderful interactive post by Naftali Harris shows.

Lastly, what happens if density changes? This problem assumes densities remain constantly above a certain bar throughout (bar determined by the two parameters), and doesn’t really differentiate between high and not-so-high densities.

Enter the H, to solve these problems!

* Assuming Euclidean. Citation here

How does HDBSCAN work?

A few core differences:

  • There’s a preprocessing step before actually finding neighbourhoods, that directly deals with the density problem
  • Instead of building clusters off iteratively finding networks of famous points, we build a tree network that connects each point with some low-distance neighbour (called a minimum-spanning tree, or MST)
  • Then, auto-select clusters based off how stable the clusters are to stray noise splitting off

This is actually very far away from DBSCAN, but has its similarities.

(tbc with code and all)

Written on June 25, 2019