Dataset

In our complete example at the end of the lesson, we’ll use a job title and description datasethttps://www.kaggle.com/datasets/kshitizregmi/jobs-and-job-description to generate text embeddings and perform a semantic search to find jobs matching a given query.

We’ll take a methodical approach to the example, starting with explaining the embedding model. Subsequently, we’ll explain the process of generating word and sentence/document embeddings using simple text examples. Finally, we’ll apply our learnings to our “job title and description” dataset to build a small application for finding jobs.

Embedding model: BERT

BERT is a popular choice to generate word, sentence, and document embeddings.

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art natural language processing model developed by Google. It’s designed to capture the bidirectional context of words in a text corpus by pretraining on large amounts of unlabeled text data. Key components of the BERT embedding model include the following:

  • Tokenization: BERT tokenizes input text into subword tokens using WordPiece tokenization. This allows BERT to handle out-of-vocabulary words and capture morphological variations.

  • Transformer architecture: BERT utilizes a transformer architecture consisting of multiple layers of self-attention mechanisms and feedforward neural networks. This architecture enables BERT to capture contextual information from the left and right contexts of each word in a sentence.

  • Pretraining: BERT is pretrained on large text corpora using two unsupervised learning tasks: masked language model (MLM) and next sentence prediction (NSP). In MLM, BERT predicts masked words in a sentence based on the context of the surrounding words. In NSP, BERT predicts whether two sentences appear consecutively in the original text.

  • Embedding generation: During pretraining, BERT learns contextualized embeddings for each token in the input text. These embeddings capture the meaning of individual words and their relationships with surrounding words in the context of a sentence.

  • Fine-tuning: After pretraining, BERT can be fine-tuned on downstream tasks such as text classification, named entity recognition, and sentiment analysis. Fine-tuning adapts BERT’s parameters to the specific task, improving its performance.

Pretrained BERT model (bert-base-uncased)

The bert-base-uncased model is a pretrained version of BERT developed by Google. It is a powerful tool for generating high-quality embeddings that capture contextual information and semantic meaning in natural language text. We’ll use it to generate word, sentence, and document embeddings.

  • Model size: bert-base-uncased refers to the base version of BERT, which consists of 12 transformer layers, 768 hidden units (dimensions) in each layer, and 110 million parameters in total. It’s a relatively smaller version compared to larger variants like bert-large.

  • Uncased variant: The “uncased” variant indicates that the model is trained on uncased text, where all text is converted to lowercase during tokenization. This variant is suitable for tasks where case sensitivity is not crucial.

  • Pretrained weights: The bert-base-uncased model has pretrained weights learned from large-scale text corpora, such as Wikipedia and BookCorpus. These weights capture rich semantic information and contextual relationships between words in natural language text.

  • Usage: After loading the pretrained bert-base-uncased model, it can be fine-tuned on downstream tasks using task-specific datasets and can also be used as a feature extractor to generate word, sentence, or document embeddings for various natural language processing tasks.

Generating word embeddings with BERT

To understand how we generate word embeddings with BERT, let’s start with two short text sequences as an example before working with large datasets. Our task is to generate embeddings for each word in both sequences.

Get hands-on with 1200+ tech skills courses.