Text summarization with Hugging Face transformers: Part 3

Home/

Blog/

Generative Ai/

8 mins read

May 10, 2024

Content

Quick recap

Summarization evaluation

Evaluating summarization model

Post-processing predictions

Setting up the metrics

Inference and scoring

Putting it all together

Tips for abstractive summarization

In this blog series, we’ve already discussed the basics of text summarization, data processing of text using Hugging Face (HF), and how we design the training loop for abstractive summarization using HF. This blog will focus on how to get inferences (output) and how to evaluate them.

Quick recap#

Text summarization is a high-level NLP task that takes a text input and produces a summary. It encompasses various categories based on the task definition: Determining input type (single-document vs. multi-document, query vs. generic), the type of summarizer (extractive vs. abstractive), the required summary format (title, one-liner vs. multi-sentence), and the language of the summary (monolingual vs. cross-lingual). Summarization problems often involve a combination of these categories.

Our previous blogs discussed the basic building blocks for training an abstractive summarization model. Let’s revisit the flowchart below and then discuss how the evaluation loop works.

Suppose we have a summarization dataset, which we have divided into three parts: Training, dev,development or validation set and test. The training and dev sets have been used during training in our previous blog. Now, we'll generate the predictions for the test set and evaluate the trained model based on these predictions by comparing them with reference summaries.

Summarization evaluation#

To evaluate a trained model, we first need its inference, which is then evaluated to determine the model’s performance. In the evaluation loop, the model is set in eval mode. The encoder receives the text to be summarized as input, and the decoder receives its contextual representations. The decoder is responsible for generating the outputs (system summaries). Here comes the question in mind: how will we evaluate the summaries? Since we have reference summaries, we can use them as a benchmark for assessing the quality of generated summaries. We have two ways to assess the quality: automatic evaluation scoring and human evaluation.

For automatic evaluation, the reference summaries are kept hidden from the model, and the outputs are compared to them to find the evaluation scores. Here, the question arises: what is an evaluation score or metric?

An automatic evaluation metric is a statistical way to quantify how a model performs and how accurate the generated summaries are. In technical terms, we can say that it provides an approximation to some ground-truth standard which in case, which in this case, is a reference summary written by humans.

We have different evaluation metrics, such as ROUGE, BLEU, BERT/BART-Score, and METEOR. The most common ones are ROUGE and BERT/BART-score in summarization. These metrics require a reference summary to compare the system summary and provide a score.

ROUGE is a n-gram overlapping measure that uses different measures to score the summary: ROUGE-1 (uni-gram overlap), ROUGE-2 (bi-gram overlap), ROUGE-L ( longest common sequence overlap), and ROUGE-S (skip-gram overlap).

ROUGE is a standard and most commonly used metric in summarization. However, it usually does not estimate abstractive summarization models well, especially when large language models are involved. Let’s understand how the n-gram overlap works with a simple example for ROUGE-1 and ROUGE-2.

The automatic evaluation is often supported by human evaluation involving human annotators.

Human evaluation is another way to evaluate the outputs, requiring multiple human judges to evaluate the summaries against the input text. The human judges (annotators) are asked to either rank the outputs or rate them (on a Likert scaleIt is used to quantify opinions, behaviors and/or attitudes. It usually provide 3–7 points on the scale ranging from "strongly agree" to "strongly disagree" to quantify the opinions.) based on different features, such as fluency, coherence, and relevance.

However, human evaluation is expensive in terms of time and resources, so it is usually performed on a small subset of the testing set. Now that we have understood the basic concept of how evaluation works, let’s dive deep into how to generate the outputs and calculate scores.

Evaluating summarization model#

Remember that we trained the model without the trainer class. The trainer class provides an efficient API for feature-complete training for various tasks. We only need to pass all hyperparameters, our dataset, and the model of our choice. However, if we choose not to use the trainer class, we must create the training and evaluation loops. Before that, let’s look at a function for processing the generated predictions and setting up our evaluation metric.

Post-processing predictions#

The following function takes two list parameters: preds (generated predictions) and labels (reference summaries).

In lines 2–3, we remove leading and trailing whitespaces for each element in preds and labels.
In lines 6–7, we perform sentence tokenization with NLTK for each summary in preds and labels. We also insert newline characters (\n) among the sentences (due to the limitation of the ROUGE implementation) and convert each summary into a single string.

test_dataloader = DataLoader(test_dataset, collate_fn=data_collator, batch_size=args.per_device_test_batch_size)
results = {}
model.eval()
    
    gen_kwargs = {"max_length": args.max_target_length if args is not None else config.max_length,
                    "num_beams": args.num_beams,}
    
    samples_seen = 0
    for step, batch in enumerate(test_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(batch["input_ids"],
                                attention_mask=batch["attention_mask"],**gen_kwargs,)
            generated_tokens = accelerator.pad_across_processes(generated_tokens, dim=1, 
                                pad_index=tokenizer.pad_token_id)
            labels = batch["labels"]
            if not args.pad_to_max_length:
                # We need to pad the labels to the same length
                labels = accelerator.pad_across_processes(batch["labels"], dim=1, pad_index=tokenizer.pad_token_id)
            generated_tokens, labels = accelerator.gather((generated_tokens, labels))
            generated_tokens = generated_tokens.cpu().numpy()
            labels = labels.cpu().numpy()
            if args.ignore_pad_token_for_loss:
                # Replace -100 in the labels as we can't decode them
                labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
            
            # If we are in a multiprocess environment, the last batch has duplicates
            if accelerator.num_processes > 1:
                if step == len(test_dataloader):
                    decoded_preds = decoded_preds[: len(test_dataloader.dataset) - samples_seen]
                    decoded_labels = decoded_labels[: len(test_dataloader.dataset) - samples_seen]
            
                else:
                    samples_seen += decoded_labels.shape[0]
            metric.add_batch(predictions=decoded_preds, references=decoded_labels,)
    
            result = metric.compute(use_stemmer=True)
            result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
            result = {k: round(v, 4) for k, v in result.items()}
            results.update(result)

From lines 9–52, we have the evaluation loop that iterates over batches in the test data loader, generates text using the trained model, and evaluates the performance on the defined metrics.

In lines 11–15, we generate tokens for the input batch and pad the generated tokens.
In lines 17–20, we handle padding for labels if the option pad_to_max_length is not enabled.
In lines 22–24, we gather generated tokens and labels across processes.
In lines 26–28, we replace -100 (we set it for the unknown token) in labels with the pad token ID.
In lines 33–34, we use the batch_decode method to convert a batch of generated tokens, generated_tokens, into a list of human-readable strings decoded_preds or decoded_labels. The skip_special_tokens argument instructs the tokenizer to exclude any special tokens (e.g., padding tokens, [CLS], [SEP]) during decoding.
In line 36, we perform post-processing with the help of the earlier defined function.
In lines 38–45, we handle duplicates in the last batch if in a multiprocess environment.
In line 47, we add the decoded predictions and labels to the metric. We apply the postprocess_text (discussed above) on decoded predictions and labels.
In lines 49–52, we compute the ROUGE score with metric.compute(use_stemmer=True) using the predictions (decoded_preds) and references (decoded_labels). The use_stemmer=True argument indicates that stemming should be applied during the computation.
In line 50, we extract the F1 score for each element, and as these scores are between 0–1, it is easy to interpret if we scale them to 100.
In line 51, we perform a round of the values to four decimal places (up to choice).
In line 52, the result of the given batch has been added to the final result dictionary.

Please note that we have not stored the predictions for later, we can do that as well by storing them in a .csv or .txt file.

Putting it all together#

This blog discussed how to set a trained model for evaluation using the Hugging Face (HF) library. We’ve reached the end of the journey we started in this blog series. We’ve learned how to set up an HF model for data processing, training, and evaluation. Let’s discuss some tips and hacks that could prove useful during your experiments.

Tips for abstractive summarization#

If you are working on a set of experiments for research purposes or finding evidence for your project, here are some bits of advice.

It’s good to compile the evaluation results of large language models based on the average of $n$ runs to provide a better estimation of the model’s performance. The value of $n$ can range between 3 to 10.
For hyperparameter settings, it is convenient to use a config file and couple it with args. Changing the values in a config file is easier, and you can have a new set of experiments with just a few changes.
For hyperparameters of your summarization model, start with default values and then either use a grid search or use wandb (weights and biases) visualizations to better understand how a model performs on your data or task. It would help you set a good parameter combination for your experiments.
The use of the trainer class is a good choice as it makes your code optimized and easy to read. You do not have to worry about debugging and verifying your training and evaluation loops. In this blog, we intentionally discussed the loops to provide a deeper understanding of how summarization works. Once you are a pro in the concepts, you can take advantages of Hugging Face wonders.
It is always good to run experiments with a couple of different models. Remember, no model is one size fits all. You have to find what suits best in your situation.
Last but not least, when analyzing and finalizing your experiment results, it is good to support them with statistical evidence. This could include statistical significance testing to determine if one model is really better than others. Human evaluation is strong evidence to support your claims if it results in your favor, so plan it in advance.

For a deeper understanding of NLP techniques and the use of Hugging Face transformers in practical scenarios, explore our comprehensive courses tailored to enhance your skills in these cutting-edge technologies:

Applying Hugging Face Machine Learning Pipelines in Python

Applying Hugging Face Machine Learning Pipelines in Python

Hugging Face is a community-driven effort to develop and promote artificial intelligence for a wide array of applications. The organization’s pre-trained, state-of-the-art deep learning models can be deployed to various machine learning tasks. In this course, you’ll explore the Hugging Face artificial intelligence library with particular attention to natural language processing (NLP) and computer vision. You’ll first explore Hugging Face’s approach to deep learning with specific attention to transformers. You’ll then learn Hugging Face’s pipeline API model and apply various pipelines to unique NLP tasks such as classification, summarization, question answering, and more. You’ll continue with a new set of Hugging Face pipelines for computer vision tasks including object detection and segmentation. By the end of this course, you’ll be familiar with a wide array of Hugging Face’s pipelines for common machine learning tasks and their implementation in Python using pytorch.

40mins

Intermediate

10 Playgrounds

2 Quizzes

Building Advanced Deep Learning and NLP Projects

In this course, you'll not only learn advanced deep learning concepts, but you'll also practice building some advanced deep learning and Natural Language Processing (NLP) projects. By the end, you will be able to utilize deep learning algorithms that are used at large in industry. This is a project-based course with 12 projects in total. This will get you used to building real-world applications that are being used in a wide range of industries. You will be exposed to the most common tools used for machine learning projects including: NumPy, Matplotlib, scikit-learn, Tensorflow, and more. It’s recommended that you have a firm grasp in these topic areas: Python basics, Numpy and Pandas, and Artificial Neural Networks. Once you’re finished, you will have the experience to start building your own amazing projects, and some great new additions to your portfolio.

5hrs

Intermediate

53 Playgrounds

10 Quizzes

Written By:

Mehwish Fatima

Free Resources

blog

How does prompt engineering differ from traditional programming?

blog

Embracing change: AI-proof your career

blog