Text Data Types

Explore the general concepts behind handling and manipulating text data types in pandas.

Importance of working with text

In today’s increasingly online world, we’re generating vast amounts of data at an unprecedented rate. One type of data that is increasing exponentially and becoming increasingly important is text data.

From social media posts to online reviews, news articles, and product descriptions, text data is ubiquitous and offers insights into consumer behavior, sentiment, and preferences. Therefore, working with text data is a crucial skill for data practitioners.

Handling text data, however, can be challenging due to its unstructured nature. Unlike structured data, typically stored in tables with predefined columns, text data can take different forms and contain a wide range of information.

Fortunately, pandas comes with numerous capabilities that can help us work with text data effectively.

Text data types

There are two ways to store text data in pandas:

  • The object data type (NumPy array)

  • The StringDtype extension type

Note: The terms dtype and data type refer to the same concept because dtype is short for data type. They are used to describe the type of data that is stored in a DataFrame or Series. To keep things clear and standardized, we’ll use data type when describing general concepts and the keyword dtype when referring to pandas code.

While it’s generally recommended to store text data using StringDtype due to its clarity, the object data type remains the default type when inferring a list of strings for backward compatibility with older pandas versions.

Suppose we have the following mock dataset of three webcam products and their corresponding retail prices:

Get hands-on with 1300+ tech skills courses.