Home/Blog/Data Science/Text summarization with Hugging Face Transformers: Part 2

Text summarization with Hugging Face Transformers: Part 2

12 min read

content

Quick recap

Implementing summarization

Models

Training

Setting up an optimizer and learning rate scheduler

Setting up evaluation criteria

Setting up logs and checkpoints

Training loop

Putting it all together

What’s next?

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

In this blog series, we have already discussed the basics of text summarization and how abstractive summarization works. We have explored how to do data processing using Hugging Face (HF). This blog will focus on how we design the training loop for abstractive summarization using HF. Let's explore the functionalities and applications of Hugging Face Transformers, which show how efficiently they can handle the data processing and training aspects of abstractive summarization.

Quick recap#

Text summarization is a high-level NLP task that takes a text input and produces a summary. It encompasses various categories based on the task definition: Determining input type (single-document vs. multi-document, query vs. generic), the type of summarizer (extractive vs. abstractive), the required summary format (title, one-liner vs. multi-sentence), and the language of the summary (monolingual vs. cross-lingual). Summarization problems often involve a combination of these categories.

Our previous blog discussed the basic building blocks for training an abstractive summarization model. Let’s revisit the flowchart below, and then discuss how the training loop works.

Suppose we have a summarization dataset, which we have divided into three parts: Training, dev, and test. The training and dev sets will be used during training to train and evaluate the sequence-to-sequence (S2S) model before the actual testing.

An S2S model consists of two neural networks: An encoder and a decoder. The encoder accepts an input sequence and transforms it into contextual representations. The decoder accepts these contextual representations and a target sequence (during training only), and it generates the output sequence.

In the training loop, the encoder receives the text to be summarized as input, and the decoder receives its contextual representations and the reference summary as a target. The decoder is responsible for generating the output (system summary). The training dataset is consumed in batches, each with a specific number of instances. The internal weights are updated after each processed batch. When all batches of the training set have been processed, it is called an epoch.

Epoch is a hyperparameter that defines how many iterations should be performed to train or fine-tune a model.

However, the training loop does not end here. Rather, it sets the model into the evaluation mode to process the dev set. Consider the dev set as a fake test set used in the training to calibrate the parameters, especially gradient descent.

Implementing summarization#

Remember that in the previous blog, we discussed that when using HF, we can either use a pipeline or a pretrained model for text summarization. In this blog, we will discuss fine-tuning a pretrained abstractive summarization model. Interestingly, HF provides two options: With or without the trainer class. The trainer class provides an efficient API for feature-complete training for various tasks. We only need to pass all hyperparameters, our dataset, and the model of our choice. However, if we opt out of using the trainer class (which is our case), we must create the training loop. Let’s discuss what the key points for the training loop are.

Models#

HF has a variety of summarization models, such as BERT for extractive summarization, BART, T5, Pegasus, ProphetNet, BigBird, and so on. Some are trained for multiple tasks and on various and/or multilingual datasets. Depending on the parameters and model size, different variations of models are available on HF. Some examples of BART are mentioned below:

facebook/bart-base“Facebook/Bart-Base · Hugging Face.” n.d. Huggingface.co. https://huggingface.co/facebook/bart-base.
facebook/bart-large-cnn“Facebook/Bart-Large-Cnn · Hugging Face.” n.d. Huggingface.co. https://huggingface.co/facebook/bart-large-cnn.
shahm/bart-german“Shahm/Bart-German · Hugging Face.” 2023. Huggingface.co. June 1, 2023. https://huggingface.co/Shahm/bart-german.
facebook/mbart-large-50-many-to-many-mm“Facebook/Mbart-Large-50-Many-To-Many-Mmt · Hugging Face.” n.d. Huggingface.co. Accessed January 1, 2024. https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt.t
eugenesiow/bart-paraphrase“Eugenesiow/Bart-Paraphrase · Hugging Face.” 2023. Huggingface.co. June 1, 2023. https://huggingface.co/eugenesiow/bart-paraphrase.

We can set our model either in the code or it can be provided via model_args. The code snippet below shows how to load a pretrained model for fine-tuning.

DataLoader is a PyTorch class used for optimized and efficient data loading in the GPU memory. At this stage, data instances will be converted into tensors. For the training set, we usually opt for shuffling, so if we rerun the same experiment, the instances will be shuffled to maintain the randomness of experiments. Now, our train and dev sets are ready to be processed in training.

Don’t confuse DataLoader with DataCollator. The DataLoader classes are responsible for transforming instances into tensors. These instances have been shaped with DataCollator (padding, indexing for batches, and so on).

The DataLoader class also helps in the parallel processing of data instances.

Setting up an optimizer and learning rate scheduler#

For training neural networks, we require an optimizer for adjusting the training parameters during training to minimize the loss. The optimization algorithms—such as Gradient Descent, Stochastic Gradient Descent, Adam, and AdaFactor—enable the model to learn from data by updating weights, biases, and learning rate iteratively. The update rules, learning rate, and momentum depend on the optimization algorithm.

The optimizer helps improve the model’s performance. However, it can widely affect the accuracy and training speed of the model.

It is important to note that weights and biases are learnable parameters of the model, while the learning rate is a hyperparameter we initially provided. The optimizer updates the learning rate with the help of the learning rate scheduler (lr_scheduler). The scheduler is responsible for making the learning rate adaptive to the optimizer for increasing performance and reducing training time. We set our optimizer (AdamW) and lr_scheduler in the code snippet below. We use the length of train_dataloader and the args.gradient_accumulation_steps hyperparameter to compute the num_update_steps_per_epoch.

# Setting optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [{"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                                "weight_decay": args.weight_decay,},
                                {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                                "weight_decay": 0.0,},]
optimizer = AdamW(optimizer_grouped_parameters, 
                  lr=args.learning_rate)
# Setting scheduler with reference to the number of training steps.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if args.max_train_steps is None:
    args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
else:
    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
lr_scheduler = get_scheduler(name=args.lr_scheduler_type, 
                            optimizer=optimizer, 
                            num_warmup_steps=args.num_warmup_steps,
                            num_training_steps=args.max_train_steps,)

Now let’s review the above code:

In lines 12–16, args.max_train_steps is used to specify the maximum number of training steps. It represents the total number of optimization steps we want to perform during training.
- If args.max_train_steps is not provided (None), it is calculated as the product of args.num_train_epochs and num_update_steps_per_epoch. In other words, it determines the maximum training steps based on the number of epochs and updates per epoch.
- If args.max_train_steps is already set, it calculates the number of training epochs required to reach this maximum number of steps.
In lines 18–21, we initialize a learning rate scheduler of our choice—in this case, args.lr_scheduler_type. We also set num_warmup_steps and num_training_steps to configure the learning rate warm-up and total training steps for the scheduler.

Setting up evaluation criteria#

Remember earlier when we discussed the optimization of our model during training? The optimization target is to minimize the loss. There are different loss functions—cross-entropy, mean squared error, mean absolute error, KL divergence, and so on—to be used for optimization. The most common one for summarization is cross-entropy. However, adding a summarization metric to get insights into training is more convenient. We’ll use ROUGE, which is a standard evaluation metric for the summarization task, to determine the behavior of our model at each epoch.

ROUGE (R) is an n-gram-based metric that evaluates n-gram overlaps between the system output and the reference summary. R-1 (uni-gram), R-2 (bi-gram), and R-L (longest common sequence) are the most reported ones.

Setting up logs and checkpoints#

Now, we are ready to set up our training logs and checkpoints. We need to log some information to check if everything is working.

Let’s review the code snippet below:

In line 2, total_batch_size is computed.
In lines 3–9, we log important information about our training. The logger object will create a txt file with all this information during training, which we can check during training to ensure that everything is working correctly.

total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
logger.info("***** Running training *****")
logger.info(f"  Num examples = {len(train_dataset)}")
logger.info(f"  Num Epochs = {args.num_train_epochs}")
logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f"  Total optimization steps = {args.max_train_steps}")
# If resuming from a checkpoint, we load weights and states.
if args.resume_from_checkpoint:
    if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
        accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
        accelerator.load_state(args.resume_from_checkpoint)
        resume_step = None
        path = args.resume_from_checkpoint
    else:
        # Get the most recent checkpoint
        dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
        dirs.sort(key=os.path.getctime)
        path = dirs[-1]  # Sorts folders by date modified
    if "epoch" in path:
        args.num_train_epochs -= int(path.replace("epoch_", ""))
    else:
        resume_step = int(path.replace("step_", ""))
        args.num_train_epochs -= resume_step // len(train_dataloader)
        resume_step = (args.num_train_epochs * len(train_dataloader)) - resume_step

The checkpoints are used to resume the training in case training is ceased (power failure, GPU timeout, CUDA out of memory, and so on).

In lines 12–23, the resume_from_checkpoint variable decides whether to resume the training or not. If it resumes, it also determines which checkpoint to load.
In lines 25–31, we check whether the given checkpoint is an epoch or a step (we can set a hyperparameter for how we want to save our checkpoints).

Training loop#

Now, we will set up our training loop with the total number of training epochs.

for epoch in range(args.num_train_epochs):
    model.train()
    if args.with_tracking:
        total_loss = 0
    for step, batch in enumerate(train_dataloader):
        if args.resume_from_checkpoint and epoch == 0 and step < resume_step:
            continue
        outputs = model(**batch)
        loss = outputs.loss
        # the loss at each epoch
        if args.with_tracking:
            total_loss += loss.detach().float()
        loss = loss / args.gradient_accumulation_steps
        accelerator.backward(loss)
        if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
            completed_steps += 1
        if isinstance(checkpointing_steps, int):
            if completed_steps % checkpointing_steps == 0:
                output_dir = f"step_{completed_steps}"
                if args.output_dir is not None:
                    output_dir = os.path.join(args.output_dir, output_dir)
                accelerator.save_state(output_dir)
        if completed_steps >= args.max_train_steps:
            break
    #model eval steps are excluded to make it easy

Let’s review the code snippet above:

In line 2, we set the model to training mode, which is necessary to enable features like dropout and batch normalization that behave differently during training and evaluation.
In lines 3–4, if the args.with_tracking flag is enabled, we initialize total_loss variable to zero to keep track of the total loss during the current epoch.
From lines 6–34, we have an inner loop for batch processing. This loop iterates over train_dataloader to process a batch of data at a time.
- In lines 7–8, if the training process is resuming from a checkpoint—as specified with args.resume_from_checkpoint—it skips steps until it reaches the step where training was paused. This is to avoid reprocessing data already processed before the interruption.
- In lines 10–11, we compute a forward pass of the model with the current batch of data and calculate the loss. The loss is typically a measure of how well the model’s predictions match the actual target values.
- In lines 14–15, if tracking is enabled, the loss from the current batch is added to the total_loss. The .detach().float() part ensures that the loss is treated as a float and detached from the computation graph.
- In lines 16–17, we compute the scaled loss by dividing the loss by args.gradient_accumulation_steps. It helps mimic training with bigger batch sizes, especially when we quickly run out of CUDA memory. We send the scaled loss back to the network with accelerator.backward() to update the weights. For example, assume we want to train our model with a batch size of 32; however, our input and output sizes prevent us from using this size. So we can complete 32 iterations with a batch size of 1, accumulate the gradients, and then divide by 32 to get the equivalent of training with a batch size of 32.
- In lines 19–24, an optimization step is taken—weights are updated—if either the current step is a multiple of args.gradient_accumulation_steps or it’s the last step in the epoch. After each optimization step, the learning rate is adjusted using the learning rate scheduler (lr_scheduler). Gradients are zeroed with optimizer.zero_grad() to prepare for the next batch.
- In lines 26–31, based on checkpointing_steps, we check if the current step is a multiple of checkpointing_steps. If it is, the model checkpoint is saved. This is a common practice to save model progress during training.
- Lines 32–34 ensure that the number of completed training steps does not exceed the maximum allowed training steps (args.max_train_steps). Otherwise, the training loop is terminated.

for epoch in range(args.num_train_epochs):
    #model train steps are exclued from here to make the code easy-to-read
    model.eval()
    if args.val_max_target_length is None:
        args.val_max_target_length = args.max_target_length
    gen_kwargs = {"max_length": args.val_max_target_length if args is not None else config.max_length,
                    "num_beams": args.num_beams,}
    
    samples_seen = 0
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(batch["input_ids"],
                                attention_mask=batch["attention_mask"],**gen_kwargs,)
            generated_tokens = accelerator.pad_across_processes(generated_tokens, dim=1, 
                                pad_index=tokenizer.pad_token_id)
            labels = batch["labels"]
            if not args.pad_to_max_length:
                # If we did not pad to max length, we need to pad the labels too
                labels = accelerator.pad_across_processes(batch["labels"], dim=1, pad_index=tokenizer.pad_token_id)
            generated_tokens, labels = accelerator.gather((generated_tokens, labels))
            generated_tokens = generated_tokens.cpu().numpy()
            labels = labels.cpu().numpy()
            if args.ignore_pad_token_for_loss:
                # Replace -100 in the labels as we can't decode them.
                labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
            
            # If we are in a multiprocess environment, the last batch has duplicates
            if accelerator.num_processes > 1:
                if step == len(eval_dataloader):
                    decoded_preds = decoded_preds[: len(eval_dataloader.dataset) - samples_seen]
                    decoded_labels = decoded_labels[: len(eval_dataloader.dataset) - samples_seen]
            
                else:
                    samples_seen += decoded_labels.shape[0]
            metric.add_batch(predictions=decoded_preds, references=decoded_labels,)
    
    result = metric.compute(use_stemmer=True)
    # Extract a few results from ROUGE
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    logger.info(result)

Let’s review the code snippet above:

In line 3, we set the model to evaluation mode to disable certain operations, like dropout and batch normalization, typically used during training. This ensures that the model’s evaluation is consistent and does not include randomness introduced by these operations.
In lines 4–5, we set the maximum target sequence length for generation during evaluation. If args.val_max_target_length is not specified, it is set to args.max_target_length.
In lines 7–8, we set a dictionary—gen_kwargs—that contains various generation settings. It specifies parameters for generating target sequences, such as the maximum length and the number of beams to use during generation. Next, we have samples_seen = 0, which we use to keep track of the number of samples processed during evaluation.
From lines 11–49, we have an inner loop for batch processing. This loop iterates over eval_dataloader to process a single batch of data at a time. Each batch contains input data and target sequences.
- In line 12, we use with torch.no_grad() to ensure that the following operations are not tracked for gradient computation. We don’t need to compute gradients during evaluation because we’re not training the model.
- In lines 13–38, we generate text sequences from the model by calling its generate method and providing input IDs and other parameters specified in gen_kwargs. The generated tokens represent the model’s predictions for the target sequences. Then, we post-process the generated tokens and the reference (target) labels. This includes padding, converting tokens to CPU and NumPy arrays, handling special tokens, and decoding the token sequences into human-readable text.
- In line 49, the decoded predictions and reference labels are used to compute evaluation metrics. Remember that we set ROUGE as our metric.
In lines 51–56, we save the results of our metric. The use_stemmer=True argument indicates that stemming should be used when comparing the generated text to reference text. Stemming reduces words to their root forms, which can help match different inflections or forms of the same word. Please note that these lines are outside of the inner loop (eval_dataloader).

Putting it all together#

Here is a well-formatted representation of the code snippets in the GitHub repositoryhttps://github.com/MehwishFatimah/t5_finetune/blob/main/run_summarization_no_trainer.py. https://github.com/MehwishFatimah/t5_finetune/blob/main/run_summarization_no_trainer.pyThis blog discusses fine-tuning pretrained abstractive summarization models using the Hugging Face (HF) library. We have learned to train a pretrained model for a given dataset. We have covered the training setup, optimizer and learning rate scheduler configuration, evaluation criteria, and setting up logs and checkpoints.

What’s next?#

So far, we have covered how the training step works for a summarization experiment. In the training, we used the training and dev sets to train and evaluate the model’s performance. In the upcoming blogs, we will explore how to evaluate a model’s performance with the test set and how the summarization metric works. See you then!

For a deeper understanding of NLP techniques and the use of Hugging Face Transformers in practical scenarios, explore our comprehensive courses tailored to enhance your skills in these cutting-edge technologies:

Applying Hugging Face Machine Learning Pipelines in Python

Applying Hugging Face Machine Learning Pipelines in Python

Hugging Face is a community-driven effort to develop and promote artificial intelligence for a wide array of applications. The organization’s pre-trained, state-of-the-art deep learning models can be deployed to various machine learning tasks. In this course, you’ll explore the Hugging Face artificial intelligence library with particular attention to natural language processing (NLP) and computer vision. You’ll first explore Hugging Face’s approach to deep learning with specific attention to transformers. You’ll then learn Hugging Face’s pipeline API model and apply various pipelines to unique NLP tasks such as classification, summarization, question answering, and more. You’ll continue with a new set of Hugging Face pipelines for computer vision tasks including object detection and segmentation. By the end of this course, you’ll be familiar with a wide array of Hugging Face’s pipelines for common machine learning tasks and their implementation in Python using pytorch.

40mins

Intermediate

10 Playgrounds

2 Quizzes

Building Advanced Deep Learning and NLP Projects

In this course, you'll not only learn advanced deep learning concepts, but you'll also practice building some advanced deep learning and Natural Language Processing (NLP) projects. By the end, you will be able to utilize deep learning algorithms that are used at large in industry. This is a project-based course with 12 projects in total. This will get you used to building real-world applications that are being used in a wide range of industries. You will be exposed to the most common tools used for machine learning projects including: NumPy, Matplotlib, scikit-learn, Tensorflow, and more. It’s recommended that you have a firm grasp in these topic areas: Python basics, Numpy and Pandas, and Artificial Neural Networks. Once you’re finished, you will have the experience to start building your own amazing projects, and some great new additions to your portfolio.

5hrs

Intermediate

53 Playgrounds

10 Quizzes

Written By:

Mehwish Fatima