Visualizing Outliers

Interact with sample code to understand how to visualize outliers.

Outliers are data points that are notably different from the main body/group of samples in our dataset. They can be found in many real-world datasets. We can see an example of an outlier in the below plot, where the outliers are data points in the 700–1000 range that are very different from the other data points in the 0–300 range.

Press + to interact
import numpy as np
import random
from matplotlib import pyplot as plt
#Create a numpy seed
np.random.seed(42)
#Generate random numbers
data = np.random.uniform(0, 500, 100)
data = np.append(data, [1000, 1025, 1030, 1055])
plt.hist(data, bins=5)
plt.title('Random Data')
plt.xlabel('Sample Variable')
plt.ylabel('Frequency')
plt.savefig('output/to.png')
plt.close(fig)

Identifying the context around outliers can help add interesting insights to narratives and help data scientists make decisions about how to handle outliers.

Let's explore three steps toward implementing solutions for outliers for data storytelling:

  1. Identifying and visualizing outliers

  2. Identifying trends and relationships of outliers and other data points

  3. Resolving or keeping outliers

Context of the data

We will be looking at the Tips dataset, composed of information one waiter collected about tips they received working in a restaurant over a few months.

Press + to interact
import plotly
#Import the tips dataset
tips_data = plotly.data.tips()
#Print the feature names and head of the dataframe
print(tips_data.columns.tolist())
print(tips_data.head(10))

The variables in the dataset include:

  • total_bill: The total bill in dollars

  • tip: The total tip in dollars ...