Categories from Unstructured Texts
Explore how to preprocess unstructured text in product data by extracting key information such as manufacturers, product codes, and technical details. Learn to segment products into meaningful categories using classification and text vectorization to improve entity resolution accuracy in Python data workflows.
We'll cover the following...
Note: The Abt-Buy dataset we use below is open data. See the Glossary of this course for attribution and references.
Structured data is comfortable. If we have names, addresses, birth dates, tax IDs, prices, etc., in separate attributes, we can compare each separately before drawing a final match vs. no-match conclusion. What can we do with unstructured data, as oftentimes is the case in product resolution scenarios? Let’s have a look at the Abt-Buy dataset.
The name and description attributes are unstructured—free texts without a consistent format. Humans recognize bits of information, such as the manufacturer, product codes, and other technical details. Can we also do this systematically with code?
Extract manufacturers
First, we isolate the manufacturer into a separate attribute. In this case, we are lucky—practically ...