Text and Token Classification
Learn to perform text and token classification tasks using the Hugging Face models.
Let's begin our NLP tasks with text classification.
Text classification
Text classification can be used to infer the type of the given text. For example, determining a book as a success based on the reviews, whether they're positive or negative, determining the passage's tone (as commonly used by the writing assistants), or verifying whether a sentence or passage is grammatically correct. We can declare a text classifier by calling text-classification
pipeline.
textClassifier = pipeline("text-classification")
Some other uses of text classification include:
- Sentiment analysis
- Natural Language Inference (NLI)
- Grammatical verification
Sentiment analysis
Have you ever wondered how companies like Amazon know if a certain product is a success or flop based on customer reviews? Thanks to NLP, we can perform their sentiment analysis. In sentiment analysis, we take a sentence and infer if it's positive, negative, or neutral.
As an example, we apply it to one of the most iconic opening lines from Herman Melville's classic, Moby Dick (1851).
textClassifier("Call me Ishmael. \Some years ago—never mind how long precisely—having little \or no money in my purse, and nothing particular to interest me on shore, \I thought I would sail about a little and see the watery part of the world. \It is a way I have of driving off the spleen, and regulating the circulation.")
Unsurprisingly, it'll return “NEGATIVE” (with a high confidence score) due to a lot of "no,” "nothing,” and so on.
Natural Language Inference (NLI)
Natural Language Inference is another application of text classification where we provide a hypothesis with some context. It determines whether the hypothesis is one of the following:
- True: Commonly referred to in both NLP literature and Hugging Face Models as entailment.
- False: Often called a contradiction. Hugging Face uses the same nomenclature.
- Undetermined: Oftentimes, there are passages unable to give some clue about the hypothesis. The model terms them as neutral.
Question-answering NLI (QNLI)
These neutral passages led to the development of Question-answering NLI (QNLI), where we get a probability of whether the given passage contains the state of the hypothesis or not.
Example
Here we’ll use the RoBERTa
Multi-genre Language Inference (MultiNLI) model. We apply it to the following pair of hypotheses and context. This returns entailment.
nliClassifier("Staying clean is a good thing. Hygiene is a lovely thing.")
Grammatical verification
We all noticed a significant improvement in the capabilities of email editors. They're quick to notify us if the email content has any grammatical errors. This improvement is not limited to email editors and can be seen in other applications such as writing assistants. All these performance gains can be attributed to improvement in grammatical verification — the type of text classification in which we verify whether a given sentence is grammatically correct or not. For complex sentences, a simple yes or no is not sufficient, and therefore, these models also return a score as a confidence or correctness measure of the input.
Token classification
It can be difficult to understand natural languages. We're required to perform some pre-processing before inputting it into an NLP model. Tokenization allows us to demarcate parts of a sentence.
Hugging Face also allows us to perform classification on these tokens. There are a couple of popular sub-tasks:
- Named Entity Recognition (NER)
- Part-of-Speech (PoS) tagging
Named Entity Recognition (NER)
In Named Entity Recognition (NER), also known as entity identification, the classifier returns the key information (entities).
After calling the ner
pipeline, we apply it to the following sentence:
tokenClassifier = pipeline("ner")tokenClassifier("Ptolemy mentions in his Geographia a city called Labokla which may have been in reference to ancient Lahore.")
In the following output, we only show a couple of entities:
[{'entity': 'I-PER','score': 0.9345261,'index': 1,'word': 'Ptolemy','start': 0,'end': 7},{'entity': 'I-MISC','score': 0.9682058,'index': 5,'word': 'G','start': 24,'end': 25},{'entity': 'I-MISC','score': 0.8515828,'index': 6,'word': '##eo','start': 25,'end': 27}]
As we can see, it returns the entities along with the starting and ending index, making it easy for parsing (if needed).
Part-of-Speech (PoS) tagging
An NLP model can be facilitated by classifying tokens into the respective parts of speech. We can use PoS tagging for this task.
We'll try it using the Bi-direction Encoder Representations from Transformers (BERT) uncased model:
posTagger = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos")
We can apply it to the sentence "This chapter will provide an overview of performing common NLP tasks." to get the following output:
Note: The output is trimmed for concision.
{'entity': 'DET','score': 0.9994802,'index': 1,'word': 'this','start': 0,'end': 4},{'entity': 'NOUN','score': 0.9989172,'index': 2,'word': 'chapter','start': 5,'end': 12},{'entity': 'AUX','score': 0.9992337,'index': 3,'word': 'will','start': 13,'end': 17}
More examples
Please click on the widget below to run the notebook containing a number of examples: