...

/

Initializing the NLP Environment

Initializing the NLP Environment

Let’s familiarize ourselves with the NLP environment initialization.

Now we’ll initialize the PyCaret NLP environment and create the transformation pipeline by using the setup() function. The target parameter lets us specify the dataset’s text column, which will go through a number of preprocessing steps as described below. After this process is completed, the first 10 instances of the preprocessed dataset are printed.

# Initializing the NLP environment

nlp_ = nlp.setup(data = data, target='text', session_id = 6842)
data_ = nlp.get_config('data_')
data_.head(10)

Initializing the NLP environment

Numeric and special character removal

Numbers and punctuation are not informative in the context of natural language processing, so PyCaret removes all numeric and special characters from the corpus. Those unnecessary characters are replaced with spaces by using regular expressions.

Word tokenization

Tokenization is the process of splitting the corpus into tokens smaller units which are usually words. This is fundamental and typically one of the first steps in NLP because it ...