Natural language processing (NLP) focuses on how computers and human language interact. It involves developing algorithms that enable computers to understand and comprehend human language and extract useful information.
Tokenization is breaking a sentence or paragraph into chunks called tokens. These tokens may be words, characters, or parts of words. By tokenizing text, NLP algorithms can operate on smaller and more meaningful units. This enables more accurate analysis, modeling, and understanding of textual data.
Now that we know what tokenization is, let's look at some tokenization techniques.
We can tokenize the given text input in various ways. We can select any method depending on the language, library, and modeling goal.
Tokenization is an essential process in natural language processing (NLP). Let's take a look at the necessary steps to implement tokenization.
Some popular NLP tokenization choices include Python with libraries like NLTK, spaCy, scikit-learn, and Apache OpenNLP (Java).
Once the text data is loaded and prepared, it's time to apply the chosen tokenization technique. The specific steps may vary depending on the library or tool we're using. Still, the general process involves calling the tokenization function or method provided by the library and passing the text data as input.
# Sample texttext = "Tokenization is important for NLP. It helps in breaking down text into individual units."# Word tokenizationword_tokens = text.split()print("Word tokens:")print(word_tokens)# Sentence tokenizationsentence_tokens = text.split(". ")print("\nSentence tokens:")print(sentence_tokens)
import nltknltk.download('punkt')from nltk.tokenize import word_tokenize, sent_tokenize# Sample texttext = "Tokenization is important for NLP. It helps in breaking down text into individual units."# Word tokenization using NLTK# Tokenize the text into individual wordsword_tokens = word_tokenize(text)print("Word tokens:")print(word_tokens)# Sentence tokenization using NLTK# Tokenize the text into individual sentencessentence_tokens = sent_tokenize(text)print("\nSentence tokens:")print(sentence_tokens)
Line 1–3: We import the NLTK library.
Line 4–10: The word_tokenize()
function from NLTK is used to tokenize the text into individual words.
Line 10–18: The sent_tokenize()
function from NLTK is used to tokenize the text into individual sentences.
import re# Sample texttext = "Tokenization is important for NLP. It helps in breaking down text into individual units."# Word tokenization using regular expressions# Pattern explanation: \b\w+\b matches one or more word characters surrounded by word boundariesword_tokens = re.findall(r'\b\w+\b', text)print("Word tokens:")print(word_tokens)# Sentence tokenization using regular expressions# Pattern explanation: (?<=\w\.)\s matches a whitespace character preceded by a word character followed by a periodsentence_tokens = re.split(r'(?<=\w\.)\s', text)print("\nSentence tokens:")print(sentence_tokens)
Line 1–3: Import re
, which stands for regular expression.
Line 4–8: The re.findall()
function searches for patterns in the given text
using the specified regular expression pattern \b\w+\b
. This pattern matches one or more word characters surrounded by word boundaries.
Line 9–14: sentence_tokens = re.split(r'(?<=\w\.)\s', text)
- This line performs sentence tokenization using regular expressions. The re.split()
function splits the given text
using the specified regular expression pattern (?<=\w\.)\s
. This pattern matches a whitespace character preceded by a word character followed by a period. The resulting sentence tokens are stored in the sentence_tokens
variable.
As explained in this Answer, there are various ways to tokenize in NLP. Here's a summary of all the functions used in popular Python libraries:
Library | Word-tokenisation Methods |
NLKT | nltk.word_tokenize |
spaCy | nlp.tokenizer |
Gensim | gensim.utils.tokenize |
Keras | keras.preprocessing.text.Tokenizer |
Sci-kit Learn | TextBlob.word_tokenize |
Free Resources