Search⌘ K
AI Features

Categories from Unstructured Texts

Explore how to preprocess unstructured text in product data by extracting key information such as manufacturers, product codes, and technical details. Learn to segment products into meaningful categories using classification and text vectorization to improve entity resolution accuracy in Python data workflows.

Note: The Abt-Buy dataset we use below is open data. See the Glossary of this course for attribution and references.

Structured data is comfortable. If we have names, addresses, birth dates, tax IDs, prices, etc., in separate attributes, we can compare each separately before drawing a final match vs. no-match conclusion. What can we do with unstructured data, as oftentimes is the case in product resolution scenarios? Let’s have a look at the Abt-Buy dataset.

C++
import pandas as pd
# Product records from two e-commerce shops:
abt = pd.read_csv('abt_buy/abt.csv', encoding='iso-8859-1')
buy = pd.read_csv('abt_buy/buy.csv', encoding='iso-8859-1')
for _, row in abt.sample(3, random_state=1).iterrows():
print(row)
print('---')

The name and description attributes are unstructured—free texts without a consistent format. Humans recognize bits of information, such as the manufacturer, product codes, and other technical details. Can we also do this systematically with code?

Extract manufacturers

First, we isolate the manufacturer into a separate attribute. In this case, we are lucky—practically ...