...

/

Categories from Unstructured Texts

Categories from Unstructured Texts

Explore how to extract structured features from unstructured texts with pattern matching and machine learning.

Note: The Abt-Buy dataset we use below is open data. See the Glossary of this course for attribution and references.

Structured data is comfortable. If we have names, addresses, birth dates, tax IDs, prices, etc., in separate attributes, we can compare each separately before drawing a final match vs. no-match conclusion. What can we do with unstructured data, as oftentimes is the case in product resolution scenarios? Let’s have a look at the Abt-Buy dataset.

Press + to interact
import pandas as pd
# Product records from two e-commerce shops:
abt = pd.read_csv('abt_buy/abt.csv', encoding='iso-8859-1')
buy = pd.read_csv('abt_buy/buy.csv', encoding='iso-8859-1')
for _, row in abt.sample(3, random_state=1).iterrows():
print(row)
print('---')

The name and description attributes are unstructured—free texts without a consistent format. Humans recognize bits of information, such as the manufacturer, product codes, and other technical details. Can we also do this systematically with code?

Extract manufacturers

First, we isolate the manufacturer into a separate attribute. In this case, we are lucky—practically ...