Initializing the NLP Environment
Let’s familiarize ourselves with the NLP environment initialization.
We'll cover the following...
Now we’ll initialize the PyCaret NLP environment and create the transformation pipeline by using the setup()
function. The target
parameter lets us specify the dataset’s text column, which will go through a number of preprocessing steps as described below. After this process is completed, the first 10 instances of the preprocessed dataset are printed.
# Initializing the NLP environment nlp_ = nlp.setup(data = data, target='text', session_id = 6842) data_ = nlp.get_config('data_') data_.head(10)
Numeric and special character removal
Numbers and punctuation are not informative in the context of natural language processing, so PyCaret removes all numeric and special characters from the corpus. Those unnecessary characters are replaced with spaces by using regular expressions.
Word tokenization
Tokenization is the process of splitting the corpus into tokens smaller units which are usually words. This is fundamental and typically one of the first steps in NLP because it ...