Training the spaCy Text Classifier
Let's learn about the details of spaCy's text classifier component.
In this section, we will learn about the details of spaCy's text classifier component TextCategorizer
. Previously, we saw that the spaCy NLP pipeline consists of components. We also learned about the essential components of the spaCy NLP pipeline, which are the sentence tokenizer, POS tagger, dependency parser, and named entity recognition (NER).
TextCategorizer
is an optional and trainable pipeline component. In order to train it, we need to provide examples and their class labels. We first add TextCategorizer to the NLP pipeline and then do the training procedure. The illustration below shows where exactly the TextCategorizer
component lies in the NLP pipeline; this component comes after the essential components. In the following diagram, textcat refers to the TextCategorizer
component.
A neural network architecture lies behind spaCy's TextCategorizer
. TextCategorizer
provides us with user-friendly and end-to-end approaches to train the classifier, so we don't have to deal directly with the neural network architecture. We'll design our own neural network architecture in the upcoming chapters. After looking at the architecture, we’re ready to dive into TextCategorizer
code. Let’s get to know the TextCategorizer
class first.
Getting to know the TextCategorizer
class
Now let's get to know the TextCategorizer
class in detail. First of all, we import TextCategorizer
from the pipeline components:
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
TextCategorizer
is available in two flavors, single-label classifier and multilabel classifier. As we remarked previously, a multilabel classifier can predict more than one class. A single-label classifier predicts only one class for each example, and classes are mutually exclusive. The preceding import
line imports the ...