Home/Blog/Machine Learning/Natural Language Processing with Python: A beginner's guide
Home/Blog/Machine Learning/Natural Language Processing with Python: A beginner's guide

Natural Language Processing with Python: A beginner's guide

10 min read
May 30, 2023

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

What is Natural language processing (NLP)? #

Natural language processing (NLP) is a subfield of artificial intelligence (AI) that deals with the interaction between humans and computers using natural language. NLP is concerned with developing algorithms and computational models that enable computers to understand, analyze, and generate human language.

NLP is an intersection of different fields
NLP is an intersection of different fields

NLP is a multidisciplinary field that draws on techniques from computer science, linguistics, mathematics, and psychology. Its goal is to build systems that can process and understand human language, which is a complex and nuanced form of communication that involves many layers of meaning.

NLP with Python#

As previously mentioned, NLP is a branch of AI that involves analyzing human-generated language data, including text and speech. Among industry professionals, Python is the preferred choice for manipulating text data due to its numerous advantages.

Python is both easy to read and resembles pseudocode, making it easy to produce and test code. Additionally, it has a high level of abstraction, which facilitates the development of NLP systems. Python's simplicity allows users to focus on NLP rather than programming language details, while its efficiency enables the quick creation of NLP application prototypes.

Python's popularity and robust community support make it a great choice for developing NLP systems. Furthermore, many open-source NLP libraries are available in Python as well as machine learning libraries like PyTorch, TensorFlow, and Apache Spark, which provide Python APIs.

Finally, Python's string and file operations are straightforward, making tasks such as splitting a sentence at the white spaces a one-line command. Overall, the combination of Python's strengths in string processing, the AI ecosystem, and machine learning libraries make it the ideal language for NLP development.

Applications of NLP#

  • Text classification involves categorizing text data into predefined categories or classes. This capability can be beneficial in various fields, such as spam detection, sentiment analysis, and topic modeling.

  • Named-entity recognition (NER) involves identifying and extracting entities such as people, organizations, and locations from text data. This can prove to be beneficial in various scenarios, including information extraction from sources like news articles or social media posts.

  • Part-of-speech (POS) tagging refers to the process of labeling each word in a sentence with its corresponding part of speech, such as a noun, verb, adjective, or adverb. This process is crucial in language analysis because it enables machines to understand the grammatical structure of sentences and identify the roles of different words in the sentence. The accurate tagging of parts of speech in a sentence can help improve the performance of various NLP applications, such as text classification, machine translation, and sentiment analysis.

  • Language modeling involves building probabilistic models of language that predict the likelihood of a given sequence of words. This can be useful in a variety of contexts, such as speech recognition and machine translation.

  • Machine translation—another important application of NLP—involves developing systems that can automatically translate text from one language to another. This can be useful in a variety of contexts, such as international business or cross-cultural communication.

  • Information retrieval encompasses the creation of systems and algorithms capable of extracting pertinent information from vast collections of textual data. This ability can be valuable in diverse settings, including search engines or recommendation systems.

  • Question answering systems are developed to construct automated systems that can respond to questions asked in natural language. This capability can be advantageous in various scenarios, including customer service or educational environments.

  • Text summarization is another important application of NLP and involves generating a summary of a longer piece of text, such as an article or a document.

  • Text generation pertains to creating systems that can generate natural language text, such as language models that provide writing assistance or chatbots. This ability can prove to be valuable in numerous domains, such as customer support or creative writing.

  • Speech recognition, also known as automatic speech recognition (ASR) or speech-to-text conversion, involves the automatic conversion of spoken language into text. It is a critical component of many NLP systems and is used in a wide range of applications, including virtual assistants, speech-to-text dictation software, and interactive voice response systems.

This blog will focus on text analysis, mostly on data preparation.

There are several popular NLP libraries available in Python that offer a wide range of functionalities for text processing and analysis. Some of the most commonly used NLP libraries are:

Natural Language Toolkit (NLTK)

NLTK is a popular open-source library for NLP tasks. It provides a wide range of tools for tasks such as tokenization, part-of-speech tagging, parsing, sentiment analysis, and more.

spaCy

spaCy is another well-known, open-source library for NLP tasks. It’s known for its high performance and efficient processing of large text data. It provides tools for tasks such as tokenization, part-of-speech tagging, parsing, named-entity recognition, and more.

Logos of some well-known NLP libraries in Python
Logos of some well-known NLP libraries in Python

TextBlob

TextBlob provides a simple API for common NLP tasks such as sentiment analysis, part-of-speech tagging, and noun phrase extraction. It is built on top of NLTK and provides an easy-to-use interface for common NLP tasks.

Gensim

Gensim is mainly used for topic modeling and natural language processing. It provides tools for building and training topic models such as Latent Dirichlet Allocation (LDA) and latent semantic analysis (LSA.)

scikit-learn

scikit-learn provides some NLP tools such as text preprocessing, feature extraction, and classification algorithms for text data.

Text analysis and its steps#

Text analysis, or text mining, is a process of extracting useful information and insights from textual data. It involves several steps that can be broadly classified into the following categories:

Note: The examples below, work with the English language. Note also that simple Python libraries that are executable will be used in the code. However, the NLTK library also includes inexecutable codes.

Data preparation #

The step involves cleaning and preprocessing the data to make it ready for analysis. The steps include:

  • Sentence segmentation: Sentence segmentation is the process of dividing a text into individual sentences.

def sentenceSegment(text):
sentences = []
start = 0
for i in range(len(text)):
if text[i] == '.' or text[i] == '!' or text[i] == '?':
sentences.append(text[start:i+1].strip())
start = i + 1
return sentences
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
print(sentenceSegment(text))

  Now here is sentence segmentation with the nltk library:

import nltk
nltk.download('punkt')
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
sentences = nltk.sent_tokenize(text)
print(sentences)

The following is the output:

['Hello, NLP world!', '!', 'In this example, we are going to do the basics of Text processing which will be used later.']
  • Removing unwanted characters, punctuations, symbols, etc.

import string
def remove_punctuation(input_string):
# Define a string of punctuation marks and symbols
punctuations = string.punctuation
# Remove the punctuation marks and symbols from the input string
output_string = "".join(char for char in input_string if char not in punctuations)
return output_string
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
sentences = sentenceSegment(text)
puncRemovedText = remove_punctuation(text)
print(puncRemovedText)
  • Converting the string into lowercase: This can help reduce the vocabulary size by treating words in uppercase and lowercase as the same word, which can help with some NLP tasks such as text classification and sentiment analysis.

def convertToLower(s):
return s.lower()
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
puncRemovedText = remove_punctuation(text)
lowerText = convertToLower(puncRemovedText)
print(lowerText)
  • Tokenizing: Tokenizing the text into words, phrases, or sentences. In this code below, the white space is tokenized.

#in this code, we are not using any libraries
#tokenize without using any function from string or any other function.
#only using loops and if/else
def tokenize(s):
words = [] #token words should be stored here
i = 0
word = ""
while(i <len(s)):
if (s[i] != " "):
word = word+s[i]
else:
words.append(word)
word = ""
i = i + 1
words.append(word)
return words
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
puncRemovedText = remove_punctuation(text)
lowerText = convertToLower(puncRemovedText)
tokenizedText = tokenize(lowerText)
print(tokenizedText)

Next, here’s the tokenization with the nltk library:

import nltk
# Define input text
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
#sentence segmentation - removal of punctuations and converting to lowercase
sentences = nltk.sent_tokenize(text)
puncRemovedText = remove_punctuation(text)
lowerText = convertToLower(puncRemovedText)
# Tokenize the text
tokens = nltk.word_tokenize(lowerText)
# Print the tokens
print(tokens)

The output will be as follows:

['hello', 'nlp', 'world', 'in', 'this', 'example', 'we', 'are', 'going', 'to', 'do', 'the', 'basics', 'of', 'text', 'processing', 'which', 'will', 'be', 'used', 'later']

Words like "we're" and "John's" can be tokenized using the nltk.word_tokenize function from the NLTK library. The word_tokenize function uses a tokenizer that is trained to recognize common patterns in natural language text, like contractions and possessives, and splits them into separate tokens.

import nltk
sentence = "We're going to John's house today."
tokens = nltk.word_tokenize(sentence)
print(tokens)

The output will look like this:

['We', "'re", 'going', 'to', 'John', "'s", 'house', 'today', '.']
  • Stop words—commonly used words such as “the” or “and” that are not useful for analysis—should be removed. In this example, assume that all words of length less than three will be removed. This is because shorter stop words are more likely to be articles, prepositions, or conjunctions, which are less likely to carry significant meaning on their own. Therefore, removing these stop words can help reduce noise in the text data without losing too much important information or context.

#remove stop words.
#For this code, wesimply assume all words of length less than or equal to 3
#MUST be removed.
def stopWordRemoval(words):
j = 0
while (j < len(words)):
if(len(words[j]) < 3):
words.remove(words[j])
else:
j = j + 1
return words
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later"
puncRemovedText = remove_punctuation(text)
lowerText = convertToLower(puncRemovedText)
tokenizedText = tokenize(lowerText)
cleaned = stopWordRemoval(tokenizedText)
print(cleaned)

Here’s a code that uses the nltk library to remove stop words:

import nltk
from nltk.corpus import stopwords
# Define input text
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
puncRemovedText = remove_punctuation(text)
lowerText = convertToLower(puncRemovedText)
tokenizedText = tokenize(lowerText)
# Define stop words
stop_words = set(stopwords.words("english"))
# Remove stop words
filtered_tokens = [token for token in tokenizedText if token.lower() not in stop_words]
# Print the filtered tokens
print(filtered_tokens)

The output will be:

['hello', 'nlp', 'world', 'example', 'basics', 'text', 'processing', 'used', 'later']

As seen above, the stop words in, this, we, are, going, to, do, the, of, which, will, be have been removed from the original list of tokens.

  • Stemming or lemmatizing the text to reduce words to their base form. The Porter stemmer algorithm is one of the better-known stemming algorithms used in the NLTK library as well.

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# Define input text
text = "Hello, NLP world!! In this example, we are going to do the basics of Text processing which will be used later."
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Define stop words
stop_words = set(stopwords.words("english"))
# Remove stop words
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
# Perform stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
# Print the stemmed tokens
print(stemmed_tokens)

The output will be:

['hello', 'nlp', 'world', 'exampl', 'basic', 'text', 'process', 'use', 'later']

Text exploration #

This step involves exploring and visualizing the text data to gain insights and identify patterns. The steps involved in text exploration include:

  • Counting the frequency of words or phrases in the text using techniques such as term frequency-inverse document frequency (TF-IDF.) TF-IDF can then be used to create word clouds or visualizations that identify and display the most common or important terms.

  • Identifying co-occurrence of words or phrases using techniques such as collocation analysis or co-occurrence matrices.

    • Bigrams are a type of collocation that represent two adjacent words that frequently co-occur in a text corpus. For example, "machine learning" is a bigram that often appears together in NLP-related text.

    • Mutual information is a statistical measure that can be used to identify significant co-occurrences of words or phrases in a text corpus. It measures the degree to which the occurrence of one word or phrase is related to the occurrence of another word or phrase in the same context. Mutual information can also be used to identify meaningful collocations or associations between words.

  • Using clustering techniques to group similar text data together.

  • Using language modeling techniques to generate new text that is similar in style and content to a given text corpus. Note that language modeling involves predicting the probability of a sequence of words in a given context.

Text analysis #

This stage entails examining the text data to extract valuable information and arrive at conclusions. The steps involved in text analysis include:

  • Sentiment analysis to determine the overall sentiment of the text (positive, negative, or neutral.)

  • Named-entity recognition to identify and extract entities such as people, organizations, or locations.

  • Text classification to classify the text into predefined categories or labels.

  • Text summarization to generate a summary of the text data.

  • Topic modeling to identify the underlying themes or topics within a corpus of text without prior knowledge of what those topics might be.

Interpretation and visualization #

This step involves interpreting the results of the text analysis and presenting them in a way that is easy to understand.

Visualizing and interpreting the results is an important step in text analysis
Visualizing and interpreting the results is an important step in text analysis

The steps involved in interpretation and visualization include:

  • Creating charts or graphs to visualize the results of the text analysis.

  • Using dashboards or reports to present the results in a structured and organized way.

  • Providing explanations and insights based on the results of the text analysis.

Wrapping up and next steps#

This blog offers an overview of NLP in Python, covering several fundamental concepts and techniques, including text preprocessing, tokenization, stemming and lemmatization, part-of-speech tagging, named-entity recognition, and sentiment analysis. While this blog provides a good starting point for learning NLP in Python, there is much more to explore in this field. Some of the next steps you can take include:

  • Exploring the different NLP techniques and algorithms further, such as topic modeling, word embeddings, and deep learning-based methods.
  • Practicing on real-world datasets to gain more experience in NLP, which can be done by participating in any of the courses below:
  • Building Advanced Deep Learning NLP projects

    Cover
    Building Advanced Deep Learning and NLP Projects

    In this course, you'll not only learn advanced deep learning concepts, but you'll also practice building some advanced deep learning and Natural Language Processing (NLP) projects. By the end, you will be able to utilize deep learning algorithms that are used at large in industry. This is a project-based course with 12 projects in total. This will get you used to building real-world applications that are being used in a wide range of industries. You will be exposed to the most common tools used for machine learning projects including: NumPy, Matplotlib, scikit-learn, Tensorflow, and more. It’s recommended that you have a firm grasp in these topic areas: Python basics, Numpy and Pandas, and Artificial Neural Networks. Once you’re finished, you will have the experience to start building your own amazing projects, and some great new additions to your portfolio.

    5hrs
    Intermediate
    53 Playgrounds
    10 Quizzes

    Natural Language Processing with Machine Learning

    Cover
    Natural Language Processing with Machine Learning

    In this course you'll learn techniques for processing text data, creating word embeddings, and using long short-term memory networks (LSTM) for tasks such as semantic analysis and machine translation. After completing this course, you will be able to solve the important day-to-day NLP problems faced in industry, which is incredibly useful given the prevalence of text data. The code for this course is built around the TensorFlow framework, one of the premier frameworks for industry machine learning, and the Python pandas library for data analysis. Knowledge of Python and TensorFlow are prerequisites. This course was created by AdaptiLab, a company specializing in evaluating, sourcing, and upskilling enterprise machine learning talent. It is built in collaboration with industry machine learning experts from Google, Microsoft, Amazon, and Apple.

    9hrs
    Advanced
    33 Challenges
    4 Quizzes

    Performing NLP Tasks Using the Cloudmersive API in Python

    Cover
    Performing NLP Tasks Using the Cloudmersive API in Python

    Cloudmersive’s Natural Language Processing (NLP) API is a highly flexible, helpful tool to add to the software engineer’s toolkit as it provides documentation of several APIs. In this course, you’ll be introduced to Cloudmersive’s NLP API. You’ll learn to perform basic linguistic operations using API calls, including semantic analysis, language detection, and translation between languages. You’ll also learn how to request a segmentation and rephrase a sentence through the API. Towards the end of the course, you’ll learn how to demonstrate all the operations of Natural Language Processing using the Cloudmersive NLP API in a Django application with the help of a demo application.

    1hr 30mins
    Beginner
    8 Playgrounds
    19 Illustrations

    Some other ways to get hands-on experience with natural language processing are:

    • By experimenting with different preprocessing and feature extraction techniques to improve the performance of NLP models.

    • By learning more about the domain-specific challenges in NLP, such as dealing with noisy data, handling multilingual text, and developing models for low-resource languages.

    • By staying up-to-date with the latest research and developments in NLP, reading academic papers, and following relevant conferences and journals.

    Remember, NLP is a vast and quickly evolving field, so the key to mastering it is to keep learning and experimenting with new ideas and techniques.


    Written By:
    Kamran Lodhi
     
    Join 2.5 million developers at
    Explore the catalog

    Free Resources