What is perplexity in NLP?

Assessing the efficacyThe ability to produce a desired or intended result. of natural language processing (NLP) models is a significant component in the process of model development. This assessment helps gauge the model's performance and identify areas of potential enhancement. One of the most frequently utilized metrics for this task is known as perplexity (PPL).

Understanding perplexity

Perplexity is a standard that evaluates how well a probability model can predict a sample. When applied to language models like GPT, it represents the exponentiated average negative log-likelihood of a sequence. In essence, a lower perplexity score suggests that the model has a higher certainty in its predictions.

Impact of tokenization on perplexity

The strategy employed for tokenization has a direct bearing on a model's perplexity. Tokenization is the method of dividing the input into smaller units, or tokens. The chosen units can influence the model's performance. For instance, a model that tokenizes inputs at the word level might exhibit a higher perplexity compared to one that tokenizes at the subword level, as the latter can handle a more diverse set of inputs.

Calculation of perplexity

When it comes to calculating perplexity, we often encounter constraints regarding the number of tokens the model can process, mainly due to memory limitations. For example, GPT-2 has a fixed length of 1024 tokens. To circumvent this, a sliding-window approach can be used, where the context window is repeatedly shifted, providing the model with more context for each prediction. This method offers a closer approximation to the true decomposition of sequence probability and typically results in a more favorable score.

Let's consider an example of how to calculate perplexity using the Hugging Face transformers library.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from tqdm import tqdm
# Setup device and model parameters
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "gpt2-large"
# Load the model and the tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load test dataset
wikitext_test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
# Text encoding
encoded_text = tokenizer("\n\n".join(wikitext_test["text"]), return_tensors="pt")
# Model configuration parameters
max_length = model.config.n_positions
stride = 512
sequence_length = encoded_text.input_ids.size(1)
# Initialize negative log likelihoods list and previous end location
negative_log_likelihoods = []
previous_end_loc = 0
# Processing data in strides
for start_loc in tqdm(range(0, sequence_length, stride)):
    end_loc = min(start_loc + max_length, sequence_length)
    target_length = end_loc - previous_end_loc
    # Prepare input and target ids
    input_ids = encoded_text.input_ids[:, start_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-target_length] = -100
    # Calculate negative log likelihood without gradient computation
    with torch.no_grad():
        model_output = model(input_ids, labels=target_ids)
        nll = model_output.loss
        negative_log_likelihoods.append(nll)
    previous_end_loc = end_loc
    if end_loc == sequence_length:
        break
# Calculate perplexity
perplexity = torch.exp(torch.stack(negative_log_likelihoods).mean())

Lines 1–4: Import necessary libraries and modules.
Line 7: Set the device to CUDA for GPU acceleration.
Lines 10–12: Load the pre-trained GPT-2 model and its tokenizer.
Lines 14–15: Load the wikitext-2-raw-v1 dataset and tokenize the text data.
Lines 20–23: Set up variables for the maximum sequence length, stride, and sequence length and to store negative log-likelihoods.
Lines 30–47: Loop over the tokenized input sequence, calculate the loss for each window of tokens, and append it to the list of negative log-likelihoods.
Line 50: Calculate the perplexity by exponentiating the mean of the negative log-likelihoods.

In this example, we use a stride of 512, meaning the model will have a minimum of 512 tokens of context when calculating the conditional likelihood of any single token (provided there are 512 preceding tokens available for conditioning).

Conclusion

The use of metrics like perplexity to evaluate the performance of NLP models is an essential part of model development. It aids in understanding the model's current performance and identifying areas that require improvements.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments