Why is t-SNE not used as a dimensionality reduction technique for clustering or classification?

April 12, 2018

In a recent assignment, we were told to use PCA on the MNIST digits to reduce the dimensions from 64 (8 x 8 images) to 2. We then had to cluster the digits using a Gaussian Mixture Model. PCA using only 2 principal components does not yield distinct clusters and as a result the model is not able to produce useful groupings.

However, using t-SNE with 2 components, the clusters are much better separated. The Gaussian Mixture Model produces more distinct clusters when applied to the t-SNE components.

The difference in PCA with 2 components and t-SNE with 2 components can be seen in the following pair of images where the transformations have been applied to the MNIST dataset.

I have read that t-SNE is only used for visualization of high dimensional data, such as in this answer, yet given the distinct clusters it produces, why is it not used as a dimensionality reduction technique that is then used for classification models or as a standalone clustering method?

The main reason that -SNE is not used in classification models is that it does not learn a function from the original space to the new (lower) dimensional one. As such, when we would try to use our classifier on new / unseen data we will not be able to map / pre-process these new data according to the previous -SNE results.

There is work on training a deep neural network to approximate -SNE results (e.g., the “parametric” -SNE paper) but this work has been superseded in part by the existence of (deep) autoencoders. Autoencoders are starting to be used as input / pre-processors to classifiers (especially DNN) exactly because they get very good performance in training as well as generalise naturally to new data.

-SNE can be potentially used if we use a non-distance based clustering techniques like FMM (Finite Mixture Models) or DBSCAN (Density-based Models). As you correctly note, in such cases, the -SNE output can quite helpful. The issue in these use cases is that some people might try to read into the cluster placement and not only the cluster membership. As the global distances are lost, drawing conclusions from cluster placement can lead to bogus insights. Notice that just saying: “hey, we found all the 1s cluster together” does not offer great value if cannot say what they are far from. If we just wanted to find the 1’s we might as well have used classification to begin with (which bring us back to the use of autoencoders).

引用自：https://stats.stackexchange.com/questions/340175

Why is t-SNE not used as a dimensionality reduction technique for clustering or classification?

相關問答

如果使用所有 PC，PCA 是否提供優勢？

PCA 名稱中的“組件”一詞應該是單數還是複數？

為什麼特徵向量揭示譜聚類中的組

當 PCA 不產生降維時，這意味著什麼？

結合 PCA、特徵縮放和交叉驗證，而不會洩露訓練測試數據

如何找到多邊形的協方差矩陣？