What is t-SNE?

T-distributed stochastic neighbor embedding (t-SNE), is a machine learning model that helps us see and understand data better. It was made by Laurens van der Maaten and Geoffrey Hinton in 2008. This program turns high-dimensional data information into a simpler picture, usually in 2D or 3D. The goal is to make the data visually simple while keeping the important connections between the points.

How it works?

The algorithm works by modeling each high-dimensional data point as a probability distribution in the lower-dimensional space. Each piece of information is like a dot on a map in the simple picture. The program then tries to ensure that the distances between these dots in the simple picture match the original distances between the data in the detailed version. This way, we can look at the data more simply and still see the important groups or patterns.

Dataset
Dataset
1 of 5

Uses

T-SNE is particularly useful for visualizing high-dimensional data such as images, text, and audio. It is used in many fields, including natural language processing, computer vision, and bioinformatics.

Example

If we use the Iris dataset, which is available in scikit-learn, it applies t-SNE to reduce the data’s dimensionality to two components and then plots the result. A different color represents each class of the Iris dataset.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Standardize the feature matrix
X_std = StandardScaler().fit_transform(X)
# Apply t-SNE to reduce the data to two components
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_std)
# Plot the results
plt.figure(figsize=(8, 6))
# Scatter plot with different colors for each class
for i in range(len(np.unique(y))):
plt.scatter(X_tsne[y == i, 0], X_tsne[y == i, 1], label=f'Class {i}')
plt.title('t-SNE Visualization of Iris Dataset')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()

Explanation

  • Lines 1–5: We import the important libraries.

  • Line 8: We load the dataset, which is a part of the sklearn library.

  • Line 9–10: We assign value to the x and y components.

  • Line 13: We standardize the feature matrix to have zero mean and unit variance.

  • Line 16: We create a t-SNE model with two components (dimensions) in the lower-dimensional space. The random_state parameter ensures the reproducibility of the results.

  • Line 17: We fit the t-SNE model to the original data (X) and transform it into the lower-dimensional space (X_tsne).

  • Lines 23–24: We plot the data points in the lower-dimensional space (X_tsne). The data points are colored according to their corresponding class labels (y) using the colors specified in the colors list.

  • Lines 27–28: We visualize high-dimensional data in lower-dimensional spaces. The results can vary with different random seeds and perplexity values.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved