How to deal with contractions in NLP

Contractions in NLP

Contractions are combinations of words that are shortened by dropping letters and replacing them with apostrophes. In NLP,Natural Language Processing it's vital to convert the text into a presentable form using text processing, which is suitable for our task. In this answer, we'll learn to expand contractions in NLP.

Why is it essential to deal with them?

There are two main reasons why we should deal with contractions in NLP:

  • A computer doesn't recognize that the contractions are abbreviations for a combination of words. Hence, it recognizes "I'm" and "I am" as two different terms with different meanings.

  • Contractions increase the dimensionality of the document-term matrixIt is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.. For instance, we'll have a column for the term "I'm" and a column for the term "I am".

How to deal with contractions

We can use the contractions library of Python to expand the contractions. It can be installed by using the following command:

pip install contractions

The following code snippet demonstrates how to expand the contractions:

import contractions
text = '''Hello mom! Yes, I'm fine. How're you? No, I didn't have lunch. I'm about to go.
Are you coming next weekend? I've been missing you.'''
expanded_text = []
for word in text.split():
expanded_text.append(contractions.fix(word))
expanded_text = ' '.join(expanded_text)
print('Input : ' + text)
print('\n')
print('Output: ' + expanded_text)

Explanation

  • Line 7–8: We use contractions.fix() to expand the shortened words, and append them to the expanded_text in a loop.

  • Line 10: We add space (' ') between the words in the expanded_text string.

Ambiguity of contractions

It's very easy to use the contractions library to expand the words. However, if we take a closer look, we observe that some contractions represent multiple word combinations. Consider the following for example:

"ain't": "am not / are not / is not / has not / have not"

The contractions library doesn't handle this ambiguity. For the example above, the package always expands to "are not."

This is demonstrated in the code below:

import contractions
text = '''I ain't doing that.'''
expanded_text = []
for word in text.split():
expanded_text.append(contractions.fix(word))
expanded_text = ' '.join(expanded_text)
print('Input : ' + text)
print('\n')
print('Output: ' + expanded_text)

The pycontractions library

We can also use the pycontractions library to expand the contractions. It works in the following way:

  • Case 1: If a contraction corresponds to only one sequence of words, pycontractions replaces the contraction with that word sequence.

  • Case 2: If a contraction corresponds to many possible expansions. Then, in that case, pycontractions produces all the possible expansions and then uses a spell checker. The grammatically incorrect options are discarded, and the correct choice is selected.

It has been observed that pycontractions is more accurate than the contractions library of python as it takes into account the grammar of the text.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved