t-SNE

So you’ve tried PCA to hopefully look at clusters. You’ve scaled your data. You’ve tried both 2D and 3D. You’ve tried scaling your data in other ways. Thrown the Box-Cox at it.

And it still looks like a huge blob.

What now? Enter t-SNE, an algorithm created by van Der Maaten, et. al. (2008). In this post, we walk through what t-SNE is, how it works, and how (and how not) to use it.


What is t-SNE?

Unfortunately, even after a couple years of it being out, t-SNE doesn’t have many good explanations online.

Fundamentally, t-SNE (which stands for t-distributed Stochastic Neighbour Embedding) is a dimensionality-reduction algorithm (i.e. tries to reduce the number of dimensions/columns) that fundamentally differs from PCA on a few fronts:

  • PCA performs a linear transformation of the data; t-SNE is nonlinear
  • PCA works by picking a direction that maximizes the variance (aka find the direction that has the biggest variability in datapoints); t-SNE takes the variability and converts it (loosely) into probabilities, then maps it down to lower dimensions
    • To elaborate a bit more,
  • PCA is deterministic* (no randomness involved); t-SNE is random

* in theory. In practice, depending on the SVD you use, you may find differences.

The original paper is here, if readers are interested.

How does t-SNE work?

The fundamental guiding principle of t-SNE (as per other dimensionality reduction algorithms) is this: points that are close by in a larger-dimensional space ought to be close by, when mapped into a smaller dimensional space.

There are questions that are raised here:

  • what does it mean to be close by?
  • how should the mapping occur?

The first question is answered, interestingly, by probability. If a point $x_i$ is close to another point $x_j$, then the author assigns a score to that point-point combination using the Gaussian distribution*: the Gaussian score, $\exp(\frac{-\left|x_i - x_j\right|^2_2}{2\sigma_i^2})$, between the two points, divided by the sum of Gaussian scores of every other point with $x_j$. This is the basis of the joint probability of the large-dimension-probability-distribution, $P$.

Intuitively, this means that points far away from you will have a small numerator (and thus a small value), and points nearby will have a large numerator and thus a high value. This is the start of the Stochastic Neighbour Embedding process.

(There are some things that the authors modify - due partly to optimization issues - namely: how outliers are handled and changing the fact that $x_i$’s score for $x_j$ is different from $x_j$’s score for $x_i$.)

The key to answering the second question is by noting that if your probability scores are high in the large-dimension space, they should be “relatively” high in the small-dimension space $Q$ as well. The most natural mapping metric to use to determine if large-dimension-probability-distribution fits small-dimension-probability-distribution is this thing called the Kullback–Leibler divergence, which was one of the OG functions to quantify how different two distributions are*. So, like any good ML model, they set the KL divergence as the cost/loss function, and solve it by minimizing it! (Of course, this makes sense because you want to minimize the difference between large-dimension-probability-distribution and small-dimension-probability-distribution).

(Here’s where Gradient Descent comes in - to solve for the small-dimension-probability-distribution by minimizing KL divergence. Problem being that the cost function isn’t convex, hence there is bound to be randomness).


But astute-reader-you may note, “where’s the t in this?” Hold on tight.

There is one key problem with the current procedure, which comes about when trying to squeeze points from higher dimensions to lower ones. Imagine a tetrahedron with a point right in the center. How would we try to map this back down to a 2D surface? If we started with the center point, we could easily draw a circle around that center point, and put the 4 corners of the tetrahedron as 4 equidistant points on the circle. But that means that the opposite points are misrepresented (because a tetrahedron’s 4 corners are meant to be equally far apart from each other).

There’s no solution to that, and note that by doing so we have inadvertently also bunched these 4 (faraway) points closer to each other than needed. This is part of the essence of the ‘crowding problem’, which leads to faraway points being bunched/crushed together. Not a good idea, if you’re trying to visualize clusters.

So we fix that by relying on our friend, Student and his $t$-distribution (or actually, Cauchy)*. Remember from AP statistics that the $t$ distribution has thicker tails? Perfect for this case, indeed, because all of those faraway points (i.e. points that would have been in the long tail and then cramped up) have space to be actually in the long tail. So instead of using a Gaussian to represent $Q$ like we did with $P$, we use a $t$-distribution for $Q$!

And there we go, that’s t-SNE for you.

In sum:

  • Define a Gaussian joint probability for all points in the larger dimension space
  • Map that over to a smaller dimension, with joint probability $t$ with 1 degree of freedom
  • Solve the mapping by minimizing the KL divergence (as a matrix) using gradient descent (or related)

There’s one final part I omitted: the parameter.

Remember our $\sigma$ from the first question’s answer? No one mentioned anything about calculating that, naughty. $\sigma$ is related to this measure called perplexity, which is positively related to this concept called Shannon entropy. The higher the perplexity, the higher the “smooth measure of the effective number of neighbors”, according to the author.

* The KL divergence, which has its roots in information theory, is neat also because its asymmetry means that the cost for putting two originally close points far apart in the lower-dimension space is much higher than the opposite way round. Note that the KL divergence formula here is , where $p_i$ and $q_i$ are defined earlier as the probability distributions for the large-dimension and small-dimension spaces for point $i$.

* $\text{Cauchy} (0,1) \sim \text{t}(\text{df} = 1)$

How do I use it correctly?

  1. Parameters: the perplexity
    • I highly encourage people to read this link on the perils of perplexity. Changing it can vary the distribution so much that different cluster shapes can easily form, so even if the recommended perplexity is somewhere between 5 to 50, it’d be good to play with them.
  2. Parameters: Learning rate and iteration
    • Like many methods that rely on gradient descent, learning rate and iteration are both tunable, and should be tuned if you fear a crappy local minima / stopping before fully optimized. The above article also says, “If you see a t-SNE plot with strange “pinched” shapes, chances are the process was stopped too early”.
  3. Setting seeds:
    • Please. As I mentioned, this is random, so…
  4. How to run it in practice?
    • Note that the best Python package to run this is NOT sklearn.manifold. That thing is slow. Use this, pip install MulticoreTSNE.
    • If you are dealing with many columns (i.e. 100+), the author of t-SNE recommends you do some kind of dimensionality reduction first, such as autoencoder embeddings.
    • Code samples for setting this up are here. As a sample:

Code Samples

  • How do you get set up?
pip install MulticoreTSNE

And once you start Python:

from MulticoreTSNE import MulticoreTSNE as TSNE

This will be very similar to the actual scikit-learn API.

  • To use it:
# initialize it:
tsne_clustering = TSNE(n_components=2, perplexity=30,
                       learning_rate=100, n_iter=1000,
                       random_state=42)
# To be clear:
# - n_components - how many dimensions / columns do you want 
#    the reduced-dimension data in?
# - perplexity - as described earlier, somewhat similar to 
#    number of neighbours
# - learning_rate: parameter between 10. to 1000., according 
#    to docs. this follows similar principles to any gradient
#    descent method
# - n_iter: number of iterations of gradient descent. Minimum
#    is 250. If the data doesn't seem to make sense 
#    it may help to increase n_iter
# - random_state: seed  

# as per sklearn API, everything works with fit and transform
tsne_clustering.fit(orig_df)
low_dim_df = tsne_clustering.transform(orig_df)
# This is the same as fit_transform:
import numpy as np
assert all(np.equal(low_dim_df, 
                    TSNE(random_state=42).fit_transform(orig_df)))
  • To save and load it:
import joblib # everyone's favourite serialization lib
# saving
joblib.dump(tsne_clustering, open('FILE_NAME.joblib', 'w'))

# loading
tsne_clustering = joblib.load(open('FILE_NAME.joblib', 'r'))
Written on June 20, 2019