How to do tokenization in NLP

Natural language processing (NLP) focuses on how computers and human language interact. It involves developing algorithms that enable computers to understand and comprehend human language and extract useful information.

What is tokenization?

Tokenization is breaking a sentence or paragraph into chunks called tokens. These tokens may be words, characters, or parts of words. By tokenizing text, NLP algorithms can operate on smaller and more meaningful units. This enables more accurate analysis, modeling, and understanding of textual data.

Now that we know what tokenization is, let's look at some tokenization techniques.

Tokenization techniques

We can tokenize the given text input in various ways. We can select any method depending on the language, library, and modeling goal.

Tokenization techniques
Tokenization techniques

Tokenization implementation

Tokenization is an essential process in natural language processing (NLP). Let's take a look at the necessary steps to implement tokenization.

Choose a programming language or library

Some popular NLP tokenization choices include Python with libraries like NLTK, spaCy, scikit-learn, and Apache OpenNLP (Java).

Apply tokenization techniques

Once the text data is loaded and prepared, it's time to apply the chosen tokenization technique. The specific steps may vary depending on the library or tool we're using. Still, the general process involves calling the tokenization function or method provided by the library and passing the text data as input.

Tokenization: Using Python's inbuilt method 

# Sample text
text = "Tokenization is important for NLP. It helps in breaking down text into individual units."
# Word tokenization
word_tokens = text.split()
print("Word tokens:")
print(word_tokens)
# Sentence tokenization
sentence_tokens = text.split(". ")
print("\nSentence tokens:")
print(sentence_tokens)

Tokenization by using NLTK

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
# Sample text
text = "Tokenization is important for NLP. It helps in breaking down text into individual units."
# Word tokenization using NLTK
# Tokenize the text into individual words
word_tokens = word_tokenize(text)
print("Word tokens:")
print(word_tokens)
# Sentence tokenization using NLTK
# Tokenize the text into individual sentences
sentence_tokens = sent_tokenize(text)
print("\nSentence tokens:")
print(sentence_tokens)

Code explanation:

  • Line 1–3: We import the NLTK library.

  • Line 4–10: The word_tokenize() function from NLTK is used to tokenize the text into individual words.

  • Line 10–18: The sent_tokenize() function from NLTK is used to tokenize the text into individual sentences.

Tokenization by using regular expressions(RegEx) 

import re
# Sample text
text = "Tokenization is important for NLP. It helps in breaking down text into individual units."
# Word tokenization using regular expressions
# Pattern explanation: \b\w+\b matches one or more word characters surrounded by word boundaries
word_tokens = re.findall(r'\b\w+\b', text)
print("Word tokens:")
print(word_tokens)
# Sentence tokenization using regular expressions
# Pattern explanation: (?<=\w\.)\s matches a whitespace character preceded by a word character followed by a period
sentence_tokens = re.split(r'(?<=\w\.)\s', text)
print("\nSentence tokens:")
print(sentence_tokens)

Code explanation

  • Line 13: Import re, which stands for regular expression.

  • Line 4–8: The re.findall() function searches for patterns in the given text using the specified regular expression pattern \b\w+\b. This pattern matches one or more word characters surrounded by word boundaries.

  • Line 9–14: sentence_tokens = re.split(r'(?<=\w\.)\s', text) - This line performs sentence tokenization using regular expressions. The re.split() function splits the given text using the specified regular expression pattern (?<=\w\.)\s. This pattern matches a whitespace character preceded by a word character followed by a period. The resulting sentence tokens are stored in the sentence_tokens variable.

As explained in this Answer, there are various ways to tokenize in NLP. Here's a summary of all the functions used in popular Python libraries:

Tokenization methods in Python libraries

Library

Word-tokenisation Methods

NLKT

nltk.word_tokenize

spaCy

nlp.tokenizer

Gensim

gensim.utils.tokenize

Keras

keras.preprocessing.text.Tokenizer

Sci-kit Learn

TextBlob.word_tokenize

Copyright ©2024 Educative, Inc. All rights reserved