Search⌘ K

Finding Outliers in Data

Understand what outliers are and why they occur in datasets. Explore methods to detect outliers, especially using the Interquartile Range (IQR) method and visualization. Learn how to manage outliers by adjusting values to fit within defined bounds to improve data quality for predictive analysis.

What is an outlier?

Anything that lies outside the normal distribution of the provided dataset is known as an outlier. Let’s suppose a list has these elements: [32,30,39,35,31,4,37]. It is quite evident that 4 is the outlier in this list because all the other elements lie around a mean value of 35. Similarly, any data point that behaves differently from the rest of the set is known as an outlier.

Why do outliers exist?

An outlier in any dataset mostly exists for the following two reasons:

  1. Variance in data: There can always be anomalies and ambiguities in data, which can be quite different from the normal distribution.

  2. Entry error: This occurs mainly due to human error while preparing the dataset or entering values.

Identifying outliers

There are two main methods used to identify outliers in any dataset:

  1. Visualization plots: The outliers are clearly visible if we plot the data in a scatter, box, or histogram plot, as they are away from the center of the data. More about this will be
...