Introduction to Word Embedding
In this lesson, we will discuss about word embedding and their types.
We'll cover the following...
What is word embedding?
If you know or have knowledge about NLP, then you might have created vectors for text, i.e., converting textual data to numbers using the two most used techniques: TF-IDF (Term Frequency-Inverse Document Frequency) and CountVectorizer. Let’s look closely at these two techniques.
TF-IDF
It stands for Term Frequency-Inverse Document Frequency. It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. TF-IDF for a word in a document is calculated by multiplying two different metrics:
- The term frequency of a word in a document. There are several ways of calculating this frequency, the simplest being a raw count of instances a word appears in a document. Then there are ways to adjust the frequency: by length of a document or by the raw frequency of the most frequent word in a document.
- The inverse document