...

/

Generating and Storing Embeddings in ChromaDB Using BERT

Generating and Storing Embeddings in ChromaDB Using BERT

Learn how to use BERT to generate and store embeddings of words in ChromaDB.

Dataset

In our complete example at the end of the chapter, we will use a moviehttps://www.kaggle.com/datasets/rounakbanik/the-movies-dataset dataset to generate text embeddings and perform a semantic search to find movies matching a given query. We are only creating embeddings for the first fifty movie descriptions in the dataset with columns "genre," and "title" to facilitate faster code execution.

Generating text embeddings

To understand how we generate word embeddings with BERT, let’s start with two short text sequences as an example before working with large datasets.

Press + to interact
# Sample data
movie_info = ["Titanic is a 1997 American epic romantic disaster film directed, written, co-produced, and co-edited by James Cameron."
"Incorporating both historical and fictionalized aspects, it is based on accounts of the sinking of RMS Titanic in 1912."]

Our task is to generate embeddings for each word in both sequences.

Step 1: Data preprocessing

The first step is to preprocess the text. Preprocessing involves tokenizationConverting text sequence into individual words called tokens., lemmatizationA lemmatizer is a tool used in natural language processing (NLP) to reduce words to their base or root form, known as a lemma. This process is called lemmatization. Unlike stemming, which simply cuts off the end of words (often resulting in non-real words), lemmatization considers the context and converts the word to its meaningful base form. ...