Outliers

Learn about outliers, reasons for handling them, and how to find them.

Defining outliers

An outlier in a dataset is a record that stands out from the others due to its significant deviation from the rest of the data. Mistakes in data entry or measurement can cause these records. On the other hand, they could be legitimate but unusual records. Generally, outliers can significantly affect the results of data analysis, so it’s common practice to identify and examine them before further research is done.

An outlier in a dataset
An outlier in a dataset

Examples of outliers in a dataset include:

  • In a dataset of patient records, an outlier may be a patient with a much shorter or longer hospital stay than other patients.

  • In a dataset of student grades, an outlier may be a student with a much lower or higher grade than the rest of the class.

Why deal with outliers

Here are reasons why it's crucial to deal with outliers in a dataset:

  • Outliers in a dataset can impact analysis, including measures of central tendency (e.g., mean and median) and variability (e.g., standard deviation). As an example, consider a group of students' heights. If the dataset includes a student much taller or shorter than the rest of the class, this would significantly affect the mean height of the class.

  • Outliers can affect the fit of statistical models, such as linear regression. For example, if a model is trying to predict house ...