...

/

Getting to Know the Dataset

Getting to Know the Dataset

Let's have a look at the dataset we'll be using and try to understand it.

We'll cover the following...

Previously, we worked on well-known real-world datasets for text classification and entity extraction purposes. We always explore our dataset as the very first task. The main point of data exploration is to understand the nature of the dataset text in order to develop strategies in our algorithms that can tackle this dataset. We learned earlier that the following are the main points we should keep an eye on during our exploration:

  • What kind of utterances are there? Are utterances short text or full sentences or long paragraphs or documents? What is the average utterance length?

  • What sort of entities does the corpus include? Person names, organization names, geographical locations, street names? Which ones do we want to extract?

  • How is punctuation used? Is the text correctly punctuated, or is no punctuation used at all?

  • How are the grammatical rules followed? Is capitalization correct, and did the users follow the grammatical rules? Are there misspelled words?

The previous datasets we used consisted of (text, class_label) pairs to be used in text classification tasks or (text, ...