A. What defines similarity?

To find similarities between data observations, we first need to understand how to actually measure similarity. The most common measurement of similarity is the cosine similarity metric.

A data observation with numeric features is essentially just a vector of real numbers. Cosine similarity is used in mathematics as a similarity metric for real-valued vectors, so it makes sense to use it as a similarity metric for data observations. The cosine similarity for two data observations is a number between -1 and 1. It specifically measures the proportional similarity of the feature values between the two data observations (i.e. the ratio between feature columns).

Cosine similarity values closer to 1 represent greater similarity between the observations, while values closer to -1 represent more divergence. A value of 0 means that the two data observations have no correlation (neither similar nor dissimilar).

What you'll learn from this course

Data Manipulation with NumPy

Data Analysis with pandas

Data Preprocessing with scikit-learn

Data Modeling with scikit-learn

Clustering with scikit-learn

Gradient Boosting with XGBoost

Deep Learning with TensorFlow

Deep Learning with Keras

Cosine Similarity

Chapter Goals:

A. What defines similarity?

B. Calculating cosine similarity