Word Embeddings

Introduction to word embeddings

Currently, we employ various types of word embeddings in nearly all natural language processing models. Before we delve into fairness in textual data, it’s beneficial to examine word vectors and their properties, which are vital in analyzing fairness.

Every type of data must be converted into numbers before being input into a model. Translating numerical features is straightforward. For categories, we use different forms of encoding. Images are represented as red, green, and blue pixel values. Regarding text, we transform each token into a fixed-size vector, which becomes the model’s input.

Converting tokens into vectors can be achieved in various ways, and the choice of method significantly impacts model performance. Given the complexity of mapping tokens to vectors, we often rely on pretrained models. These models use vast amounts of text data and computing power, making it challenging to create them independently. In this lesson, we will focus on two specific sources of vectors: GloVe and BERT.

Anatomy of a word vector

Let’s discuss why a word vector is needed to understand potential fairness issues in more detail.

  • The first property is the vector’s size. Larger vectors can store more semantic information, though they require more memory and take longer to train. For example, typical word embedding sizes are 300 for GloVe and 768 for BERT. Therefore, each word is represented by consecutively 300 and 768 numerical values stored as a 1D vector.

  • The second property is whether the vector is static or contextual. Static vectors maintain the exact representation of a word, regardless of context. For instance, consider the word “crane,” meaning a bird or a construction machine. In contrast, a contextual vector changes based on surrounding words, allowing the embedding to adjust. GloVe provides static vectors, while BERT offers contextual ones.

  • Frequently, the amount of training data used for training is indicated in the model name. For example, “glove6B” indicates that the used dataset consists of 6 billion tokens.

  • Each model has a specific vocabulary size (the number of words available). Bigger vocabulary means a bigger size of the model but also better language coverage (as more words have their representation).

Get hands-on with 1200+ tech skills courses.