Introduction to word embeddings

Currently, we employ various types of word embeddings in nearly all natural language processing models. Before we delve into fairness in textual data, it’s beneficial to examine word vectors and their properties, which are vital in analyzing fairness.

Every type of data must be converted into numbers before being input into a model. Translating numerical features is straightforward. For categories, we use different forms of encoding. Images are represented as red, green, and blue pixel values. Regarding text, we transform each token into a fixed-size vector, which becomes the model’s input.

Converting tokens into vectors can be achieved in various ways, and the choice of method significantly impacts model performance. Given the complexity of mapping tokens to vectors, we often rely on pretrained models. These models use vast amounts of text data and computing power, making it challenging to create them independently. In this lesson, we will focus on two specific sources of vectors: GloVe and BERT.

Anatomy of a word vector

Let’s discuss why a word vector is needed to understand potential fairness issues in more detail.

The first property is the vector’s size. Larger vectors can store more semantic information, though they require more memory and take longer to train. For example, typical word embedding sizes are 300 for GloVe and 768 for BERT. Therefore, each word is represented by consecutively 300 and 768 numerical values stored as a 1D vector.
The second property is whether the vector is static or contextual. Static vectors maintain the exact representation of a word, regardless of context. For instance, consider the word “crane,” meaning a bird or a construction machine. In contrast, a contextual vector changes ...