Search⌘ K

Natural Language Processing with PyCaret

Explore natural language processing with PyCaret by applying topic modeling on a collection of documents. Understand key NLP concepts and use probabilistic models like LDA to discover hidden topics. Gain hands-on experience loading and analyzing the BBC News dataset, then use topic probabilities for multiclass classification.

Natural language processing (NLP) is located at the intersection of computational linguistics and machine learning. The main goal in this dynamic field is to extract information and insights from natural languages, meaning those that are spoken by humans in their everyday lives. NLP comprises a wide variety of methods and techniques, including topic modeling, sentiment analysis, machine translation, document summarization, and speech-to-text conversion.

In this chapter, we’ll focus on topic modeling because it’s supported by the NLP module of the PyCaret library. We can use this technique to discover topics, the hidden structures that let us semantically group a collection of documents known as the corpus.

Latent Dirichlet allocation (LDA) is a generative probabilistic model that can be used for topic modeling, and it is defined in the following equation:

p(wα,β)=p(θα)(n=1Nznp(znθ)p(wnzn,β))dθp(\mathsf{w} \mid {\alpha}, {\beta})=\int p(\theta \mid \alpha)\left(\prod_{n=1}^{N} \sum_{z_{n}} p\left(z_{n} \mid \theta\right) p\left(w_{n} \mid z_{n}, {\beta}\right)\right) d \theta ...