What is topic modeling in NLP?

Overview

Natural Language Processing (NLP) is a field of machine learning related to interactions between the human language and computers.

Topic modelling is a part of NLP that is used to determine the topic of a set of documents based on the content.

How does topic modelling work?

widget

Topic modeling can be described as assigning a general topic to a set of documents that best describes and fits the contents of those documents. This helps us deal with immense amounts of data that surround us and would otherwise take a lot more time to organize and process.

For example, consider having a set of 200 documents, and each document consists of approximately 1000 words. This would mean there are a total of 200 x 1000 = 200,000 words that need to be processed. Now if we apply topic modeling to this set of documents and cluster them into a total of 6 topics, that would mean we now have to deal with 6 x 1000 = 6000 words.

In summary, what topic modeling essentially does is:

  • Identifies latent topical patterns that appear across the collection.
  • Annotate documents with topics based on the content.

These annotations can then be used for better organization and summarization, ultimately leading to reduced processing time.

Applications of topic modeling

In the modern-day world, topic modeling is used in various areas to draw meaningful information from vast amounts of textual data. Some of the industrial applications of topic modeling are listed below:

Sentiment Analysis

widget

Sentiment analysis can be described as the task of computationally recognizing and categorizing opinions stated in a piece of text, especially to discern whether the writer has a good, negative, or neutral attitude toward a given topic, product, and so on. Topic modeling can also be used to determine the sentiments of a given text, for example, it may be useful for a company to know what is the polarity of most of the reviews about a product that have been left on their website by their customers.

The spam or ham problem

widget

We often find our inboxes flooded with emails, so much that an email that one might be looking forward to might get lost in a large number of spam emails. Machine Learning applications, such as the one we are discussing, are being used to successfully implement spam filters, that allow important emails to always stay on top of your inbox and move spam emails to the junk folder, to prevent your inbox from being flooded.

Chatbots

widget

Over time we have seen chatbots being increasingly implemented by websites, either to collect data from their customers or to buy time while a 'human' customer support person reaches out to the customer to address the problem. Predicting conversation topics accurately can be a useful indicator for developing cohesive and interesting dialogue systems. With the help of topic modeling algorithms, chatbots are able to extract useful information for thousands of customer queries in a matter of seconds. They may classify multiple queries under a single topic and without spending too much time the management can see what the majority of the queries are about and subsequently improve their services in that area.

Topic modeling techniques

The techniques listed below are some of the most common and popular techniques that are used to perform topic modeling in NLP:

  1. Latent Semantic Analysis (LSA)
  2. Probabilistic Latent Semantic Analysis (pLSA)
  3. Latent Dirichlet Allocation (LDA)

Latent semantic analysis - LSA

LSA analyzes a set of documents and the terms contained within them. It scans unstructured data for hidden correlations between phrases and concepts using singular value decomposition (SVD), a mathematical approach that factorizes a matrix into three matrices.

Probabilistic latent semantic analysis - pLSA

The purpose of pLSA is to use a probabilistic framework to describe the co-occurrence of information in order to find the data’s underlying semantic structure.

Latent Dirichlet Allocation - LDA

Latent Dirichlet Allocation or LDA for short is another famous technique that is used to perform topic modeling on a given set of documents.

The term latent is used to show something that exists but has not been discovered yet, in this case, the topic of the document.

Dirichlet allocation comes from fact that the repeating words in the textual data follow a Dirichlet distribution, such as the model is based on the Dirichlet distribution.


It is interesting to note that all the 3 named techniques have the word Latent in their name, which indicates what their primary task is to find something that exists but has not yet been discovered. Which is essentially what topic modeling is entirely based on.

Free Resources