Classification

Why is t-SNE not used as a dimensionality reduction technique for clustering or classification?

  • April 12, 2018

In a recent assignment, we were told to use PCA on the MNIST digits to reduce the dimensions from 64 (8 x 8 images) to 2. We then had to cluster the digits using a Gaussian Mixture Model. PCA using only 2 principal components does not yield distinct clusters and as a result the model is not able to produce useful groupings.

However, using t-SNE with 2 components, the clusters are much better separated. The Gaussian Mixture Model produces more distinct clusters when applied to the t-SNE components.

The difference in PCA with 2 components and t-SNE with 2 components can be seen in the following pair of images where the transformations have been applied to the MNIST dataset.

PCA on MNIST

t-SNE on MNIST

I have read that t-SNE is only used for visualization of high dimensional data, such as in this answer, yet given the distinct clusters it produces, why is it not used as a dimensionality reduction technique that is then used for classification models or as a standalone clustering method?

The main reason that -SNE is not used in classification models is that it does not learn a function from the original space to the new (lower) dimensional one. As such, when we would try to use our classifier on new / unseen data we will not be able to map / pre-process these new data according to the previous -SNE results.

There is work on training a deep neural network to approximate -SNE results (e.g., the “parametric” -SNE paper) but this work has been superseded in part by the existence of (deep) autoencoders. Autoencoders are starting to be used as input / pre-processors to classifiers (especially DNN) exactly because they get very good performance in training as well as generalise naturally to new data.

-SNE can be potentially used if we use a non-distance based clustering techniques like FMM (Finite Mixture Models) or DBSCAN (Density-based Models). As you correctly note, in such cases, the -SNE output can quite helpful. The issue in these use cases is that some people might try to read into the cluster placement and not only the cluster membership. As the global distances are lost, drawing conclusions from cluster placement can lead to bogus insights. Notice that just saying: “hey, we found all the 1s cluster together” does not offer great value if cannot say what they are far from. If we just wanted to find the 1’s we might as well have used classification to begin with (which bring us back to the use of autoencoders).

引用自:https://stats.stackexchange.com/questions/340175

comments powered by Disqus