Categories from Unstructured Texts
Explore how to extract structured features from unstructured texts with pattern matching and machine learning.
We'll cover the following...
Note: The Abt-Buy dataset we use below is open data. See the Glossary of this course for attribution and references.
Structured data is comfortable. If we have names, addresses, birth dates, tax IDs, prices, etc., in separate attributes, we can compare each separately before drawing a final match vs. no-match conclusion. What can we do with unstructured data, as oftentimes is the case in product resolution scenarios? Let’s have a look at the Abt-Buy dataset.
import pandas as pd# Product records from two e-commerce shops:abt = pd.read_csv('abt_buy/abt.csv', encoding='iso-8859-1')buy = pd.read_csv('abt_buy/buy.csv', encoding='iso-8859-1')for _, row in abt.sample(3, random_state=1).iterrows():print(row)print('---')
The name
and description
attributes are unstructured—free texts without a consistent format. Humans recognize bits of information, such as the manufacturer, product codes, and other technical details. Can we also do this systematically with code?
Extract manufacturers
First, we isolate the manufacturer into a separate attribute. In this case, we are lucky—practically ...