Challenge: Compare the Performance of Two Different LLMs

Evaluate text generation by using multiple LLMs, and determine the best performer.

Challenge

In this challenge, we’ll explore the capabilities of two LLMs, google/flan-t5-small and bigscience/mt0-small. The task is to use these models for a specific text generation task and evaluate their performance using ROUGE metrics.

Task

Translate the German proverb “Anfangen ist leicht, beharren eine Kunst” into English using both LLMs with the transformers pipeline. Then, evaluate each model’s performance using ROUGE metrics and determine which one performs better.

Using the transformers pipeline

We can use the transformers pipeline as follows.

Note: Google’s FLAN-T5-Small is a refined version of the T5 model, developed for a diverse range of tasks without the need for additional fine-tuning. Released in the “Scaling Instruction-Finetuned Language Models” research paper, this open-source, sequence-to-sequence LLM has been fine-tuned on multiple tasks across multiple languages.

For google/flan-t5-small, use the following:

Get hands-on with 1400+ tech skills courses.