Home/Blog/Data Science/Text summarization with Hugging Face transformers: Part 1
Home/Blog/Data Science/Text summarization with Hugging Face transformers: Part 1

Text summarization with Hugging Face transformers: Part 1

Mehwish Fatima
Nov 27, 2023
10 min read

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

In this series of blogs, we will first discuss the basics of text summarization, a high-level natural language processing (NLP) task. We will then explore how abstractive summarization works and how to implement it with Hugging Face transformers. Lastly, we will find out how we can assess the summarization models. This blog will focus on the basics and data processing for summarization.

What is summarization?#

In the rapid information era and digitally globalized world, textual data is growing enormously. It is becoming hard to keep ourselves updated with loads of data. For example, an online product that we want to buy has thousands of reviews but it is hard to grasp them all. But if there is a tool that sums up all those reviews and provides a summary of those reviews, it can make our lives much easier. Similarly, consider the case of an investigative journalist who wants to collect information on a specific event from various resources. What if there is a tool that can generate a timeline summary of that specific event from previous news and other sources? Here comes the application of NLP, called text summarization. Text summarization can be divided into different categories depending on the nature of a task, as shown in the illustration below. Some important questions to ask for defining the task are:

  • What kind of input is given—single-document vs. multi-document, query vs. generic. 

  • What type of summarizer we want—selecting sentences from the original input, extractive summarization vs. generating a human-like summary containing salient information coherently, abstractive summarization.

  • What kind of summary is required—generating a text’s title or one-liner summary, extreme summarization vs. abstract like a multi-sentence summary.

  • In what language is a summary required—a summary in the source language, monolingual summarization vs. a summary in another target language, cross-lingual summarization.

Classification of  text summarization
Classification of text summarization

A summarization problem can be a mix and match of these categories. So, what is the formal definition of text summarization, and what are the properties a good summary should have? The former part of the question is easy to answer, and the latter is trickier. 

By definition, text summarization is a high-level NLP task that takes a text as an input and produces its summary as an output. A summary should contain salient information about the given text. In terms of properties, a good summary should be at least fluent, well-structured, and coherent. Depending on the nature of the task, there can be some additional properties. However, measuring these properties, such as coherence and fluency, is not a straightforward task and requires human effort.

Summarization example#

Let’s check a black box example of summarization with transformers, where we provide an input text to a summarizer, and it generates the output summary. The video below represents how we can execute transformers on Educative’s platform. This example intentionally covers a simplified version of summarization where we only provide the input and get the output.

Video thumbnail

Let’s understand how a summarization model can be trained, tested, and evaluated on a given dataset.

How does text summarization work?#

We need some building blocks for training an abstractive summarization model. Let’s check the flowchart below, and then we’ll discuss these building blocks.

Flow diagram of text summarization
Flow diagram of text summarization

Firstly, we need a summarization dataset where each instance consists of a text-summary pair. We provide these instances with a sequence-to-sequence (S2S) model to the training loop responsible for training the model. The model is trained by showing examples of input and expected output (reference summary). This kind of training is called supervised learning. The data given to the training loop is called the training set (for now, let’s ignore the dev set). Now, the trained model can produce the outputs of given texts. Suppose we saved a chunk of the dataset that wasn’t used during training—called the test set. We provide the trained model and input texts to the testing loop, which generates the output summaries for all the given inputs. This step is also called inference.

Here, a question arises—how do we know that the generated summaries are accurate? To confirm this, we need a metric that can assess the outputs. This is called the automatic evaluation of a model. We provide the output summaries and their reference summaries (which we did not use during testing) to the evaluation loop, which assesses the summaries by comparing them. The outcome of the evaluation is evaluation scores. This way, we can measure the quality of the model output. We can also evaluate the outputs given to human annotators to assess their quality, called human evaluation. The outcome of any kind of evaluation are scores that indicate how well/poorly a model has been trained.

How to implement summarization#

Enough theoretical discussion! Let’s move to the implementation part of text summarization. Summarization is a hot topic and almost all big tech companies have developed various libraries and tools for summarization. However, this blog will focus on Hugging Face transformers implementation for abstractive summarization.

Here comes a question: What is Hugging Face (HF)? Let’s make it easy. Think of HF as an umbrella for the AI community, providing a platform containing several open-source datasets, models, pipelines, and evaluation metrics with the AI community discussions. An AI developer can find almost everything required in Hugging Face.

Summarization with Hugging Face#

For abstractive summarization with HF, we need a dataset, a pipeline, or a pretrained model for training and inference and evaluation metrics for summarization. Luckily, all the pieces required—data, models, evaluation metrics—are already provided by HF. Let’s understand how these pieces work one by one. We will start with the data processing for summarization, and this blog only covers this part.

Data#

First, we need a summarization dataset in which each instance consists of a text and reference summary (or summaries) pair. We split data into train, development (dev), and test sets. We train a summarization model with train and dev sets, while the test set is unseen for the trained model to measure its performance. The split ratio can vary according to the domain and problem. However, the most common ratios are 80/10/10 or 90/5/5 for train/dev/test, respectively.

Datasets#

HF has a variety of summarization datasets ranging from news genres to long scientific papers and from covering many languages as monolingual datasets to cross-lingual datasets. The code snippet below shows how to load an existing HF dataset.

from datasets import load_dataset
dataset = load_dataset("grammarly/pseudonymization-data")
Loading an HF dataset

Other summarization datasets examples are CNN-Daily Mail, XSum, Multi-News, Amazon Reviews Multi, and arXiv datasets. Either we have separate files in the dataset to split the data, or it can be done in the code. We can also use custom datasets with HF models.

Data processing#

Now that we have our dataset, the next step is to process it before forwarding it to the model for training. For this processing, we need a tokenizer and a data collator. The tokenizer is responsible for tokenizing the data and maintaining a vocabulary. These days, byte pair encoding (BPE) or sub-word tokenization techniques are popular as they reduce the vocabulary size effectively.

Tokenizer#

There are plenty of different tokenizers available on HF. However, it is important to use the same tokenizer as the model. For example, if we want to use a pretrained model—BART—the data tokenization must be performed by the BART tokenizer. In many cases, we want to make our code flexible so we can reuse it for applying various models. The good news is that HF provides this flexibility with Auto Classes. In the code snippet below, we have AutoTokenizer, which will eventually get the tokenizer from_pretrained models of our choice.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name
if model_args.tokenizer_name
else model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
use_fast=model_args.use_fast_tokenizer,
revision=model_args.model_revision,
use_auth_token=True
if model_args.use_auth_token
else None,)

Don’t worry about model_args; these are parameters given with the execution command for the code. These args are maintained with a helper class AutoConfig by HF. The AutoConfig class ensures the correct parameter mappings with data, models, and metrics. The tokenizer_name parameter selects the given model for the tokenizer, cache_dir is the name of the folder if we want to change the cache for HF, and use_fast_tokenizer selects a speedy implementation based on the Rust library for tokenizers. The model_revision parameter is used to select the specific version of a model. The use_auth_token parameter is used for security to use a bearer token for remote files on the datasets hub.

Remember: Tokenizers are responsible for tokenization, truncation, padding of data, and adding special tokens. Tokenizers are also responsible for encoding (text-to-vector) and decoding (vector-to-text).

If we are working with multilingual or cross-lingual data, we have to set source and target languages. We also have to set forced_bos_token_id for the decoding.

if isinstance(tokenizer, tuple(MULTILINGUAL_TOKENIZERS)):
assert (data_args.lang is not None),
f"{tokenizer.class.name} is a multilingual tokenizer which requires --lang argument"
tokenizer.src_lang = data_args.lang
tokenizer.tgt_lang = data_args.tgt_lang
# For multilingual translation models like mBART-50 and M2M100 we need to force the target language token
# as the first generated token. We ask the user to explicitly provide this as --forced_bos_token argument.
forced_bos_token_id = (tokenizer.lang_code_to_id[data_args.forced_bos_token]
if data_args.forced_bos_token is not None else None)
model.config.forced_bos_token_id = forced_bos_token_id

Now, we have initialized our tokenizer, but we haven’t applied it to our data. For this, we create a function that we use to apply our tokenizer on each set (train, dev, and test) on the text (input) and the reference summary (target).

def preprocess_function(examples):
inputs, targets = [], []
for i in range(len(examples[text_column])):
if examples[text_column][i] is not None and examples[summary_column][i] is not None:
inputs.append(examples[text_column][i])
targets.append(examples[summary_column][i])
inputs = [prefix + inp for inp in inputs]
model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)
# If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
# padding in the loss.
if padding == "max_length" and data_args.ignore_pad_token_for_loss:
labels["input_ids"] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]]
model_inputs["labels"] = labels["input_ids"]
return model_inputs

In line 8, we add prefix at the start of each input text. For some pretrained models, adding the task name as prefix is required because these models are trained for multiple NLP tasks. Then, in line 9, we apply the tokenizer to our inputs where max_length=data_args.max_source_length sets the maximum accepted length of the given input (2048 tokens at max), padding=padding is a boolean flag setting if we have to pad the text if it is less than max_length, and truncation=True is also a boolean flag setting to truncate if the given text is longer than the maximum length. In lines 12–13, with tokenizer.as_target_tokenizer() sets the tokenizer for the decoding side, and then we tokenize our summaries, which max_length=max_target_length sets the maximum length of the target. We don’t want to include the padding token in the loss calculations for model optimization. Lines 17–18 ensure that the padding token is ignored.

Data collator#

Now, we have our tokenizer all set; however, we can’t process all data simultaneously due to resource limitations. We need batch processing to process chunks of data in multiple iterations. Here comes DataCollator for our help. It loads data as batches into memory and also performs shuffling among the instances if enabled. The code snippet below shows a DataCollator for summarization.

from transformers import DataCollatorForSeq2Seq
# Data collator
label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model,
label_pad_token_id=label_pad_token_id,
pad_to_multiple_of=8 if training_args.fp16 else None,)

The DataCollatorForSeq2Seq constructor (for the time being, just ignore Seq2Seq) accepts the selected tokenizer and model along with label_pad_token_id to ignore it during loss calculation. It also takes pad_to_multiple_of=8 for padding to a multiple of the given value.

Batch size depends on many factors—the length of input and output text, size of the loaded model and tokenizer, and specifications of hardware resources (GPU memory).

What’s next? #

So far, we have understood the basics of text summarization, its applications, and what HF is. In the upcoming blogs, we will explore the flow of a summarization model and how to load a dataset, process it for a summarization model, and evaluate the outputs. In the next blog, we will see what models we require for summarization and how to train and test them. See you then in the next blog!

Interested to learn more about Hugging Face and NLP? Check out the following courses:

Applying Hugging Face Machine Learning Pipelines in Python

Cover
Applying Hugging Face Machine Learning Pipelines in Python

Hugging Face is a community-driven effort to develop and promote artificial intelligence for a wide array of applications. The organization’s pre-trained, state-of-the-art deep learning models can be deployed to various machine learning tasks. In this course, you’ll explore the Hugging Face artificial intelligence library with particular attention to natural language processing (NLP) and computer vision. You’ll first explore Hugging Face’s approach to deep learning with specific attention to transformers. You’ll then learn Hugging Face’s pipeline API model and apply various pipelines to unique NLP tasks such as classification, summarization, question answering, and more. You’ll continue with a new set of Hugging Face pipelines for computer vision tasks including object detection and segmentation. By the end of this course, you’ll be familiar with a wide array of Hugging Face’s pipelines for common machine learning tasks and their implementation in Python using pytorch.

40mins
Intermediate
10 Playgrounds
2 Quizzes

Building Advanced Deep Learning and NLP Projects

Cover
Building Advanced Deep Learning and NLP Projects

In this course, you'll not only learn advanced deep learning concepts, but you'll also practice building some advanced deep learning and Natural Language Processing (NLP) projects. By the end, you will be able to utilize deep learning algorithms that are used at large in industry. This is a project-based course with 12 projects in total. This will get you used to building real-world applications that are being used in a wide range of industries. You will be exposed to the most common tools used for machine learning projects including: NumPy, Matplotlib, scikit-learn, Tensorflow, and more. It’s recommended that you have a firm grasp in these topic areas: Python basics, Numpy and Pandas, and Artificial Neural Networks. Once you’re finished, you will have the experience to start building your own amazing projects, and some great new additions to your portfolio.

5hrs
Intermediate
53 Playgrounds
10 Quizzes

  

Free Resources