Visualizing Outliers
Interact with sample code to understand how to visualize outliers.
Outliers are data points that are notably different from the main body/group of samples in our dataset. They can be found in many real-world datasets. We can see an example of an outlier in the below plot, where the outliers are data points in the 700–1000 range that are very different from the other data points in the 0–300 range.
import numpy as npimport randomfrom matplotlib import pyplot as plt#Create a numpy seednp.random.seed(42)#Generate random numbersdata = np.random.uniform(0, 500, 100)data = np.append(data, [1000, 1025, 1030, 1055])plt.hist(data, bins=5)plt.title('Random Data')plt.xlabel('Sample Variable')plt.ylabel('Frequency')plt.savefig('output/to.png')plt.close(fig)
Identifying the context around outliers can help add interesting insights to narratives and help data scientists make decisions about how to handle outliers.
Let's explore three steps toward implementing solutions for outliers for data storytelling:
Identifying and visualizing outliers
Identifying trends and relationships of outliers and other data points
Resolving or keeping outliers
Context of the data
We will be looking at the Tips dataset, composed of information one waiter collected about tips they received working in a restaurant over a few months.
import plotly#Import the tips datasettips_data = plotly.data.tips()#Print the feature names and head of the dataframeprint(tips_data.columns.tolist())print(tips_data.head(10))
The variables in the dataset include:
total_bill
: The total bill in dollarstip
: The total tip in dollars ...