Text Normalization

Learn how to perform numeric digit normalization and handle contractions using Python.

Numeric digit normalization

In text data, numbers can appear in diverse formats, leading to challenges in analysis and modeling. For example, “two pizzas” and “2 pizzas” might refer to the same quantity but appear differently. Numeric digit normalization addresses these discrepancies, allowing algorithms to treat different representations of the same number as equivalent. It involves converting different representations of numeric digits within text data into a standardized format and, as a result, helps ensure consistency in the representation of numbers, making it easier to analyze and understand the data.

Common approaches for performing numeric digit normalization include:

  • Converting words to digits: This approach involves converting numeric words to their corresponding digits. For example, “five” would be transformed into “5.” This would ensure that numeric words are consistently represented as digits, making them compatible with calculations and comparisons.

  • Converting digits to words: This technique involves converting numeric digits to words, which can enhance text readability. For instance, “10” could be transformed into “ten.”

  • Removing numeric separators: Numeric digits might be separated by commas, spaces, or other symbols. Removing these separators would ensure that numeric representations remain uniform. For example, “1,000” and “1000” would be normalized to “1000.”

Let’s use the word2number library to convert words to digits, the inflect library to convert digits to words, and the regular expressions library to remove numeric separators.

Get hands-on with 1200+ tech skills courses.