Bokeh is a Python library used for creating interactive visualizations in a web browser. It provides powerful tools that offer flexibility, interactivity, and scalability for exploring various data insights.
Box plots are widely used to represent a visual data summary for the dataset using statistical measures. These measures are commonly used to assess the range and tendency of the dataset for detailed insights into the data distribution.
Upper extreme: It is the maximum value in the dataset that depicts the highest data range can go.
Upper quartile: It is the third quartile that represents the upper bound value below which 75% of the data falls.
Median: It is the middle value that divides the dataset into two halves, i.e., 50% dataset is above it, and 50% dataset is below it.
Lower quartile: It is the first quartile that represents the upper bound value below which 25% of the data falls.
Lower extreme: It is the minimum value in the dataset that depicts the lowest data range can go.
Box plots are widely used in industry and research centers to analyze the achieved outputs and results in various domains.
import pandas as pdfrom bokeh.io import output_file, savefrom bokeh.models import ColumnDataSource, Whiskerfrom bokeh.plotting import figure, showfrom bokeh.sampledata.autompg2 import autompg2from bokeh.transform import factor_cmap
pandas:
To manipulate data.
bokeh.io:
To control the output and display of the plots. We specifically import output_file
and save
methods from.
bokeh.models:
To create highly customized visualizations in Bokeh. We specifically import ColumnDataSource
and Whisker
methods from it.
bokeh.plotting:
To create and customize plots without working directly with the lower-level Bokeh models. We specifically import figure
and show
methods from it.
bokeh.sampledata:
To import and access the available datasets for Python Bokeh and use them to test your code. autompg2
is one of the datasets that contain information about various car models, including MPG, engine displacement, cylinders, and fuel consumption.
bokeh.transform:
To transform the data by adding visual properties such as colors, sizes, and positions. We specifically import factor_cmap
methods from it.
import pandas as pd from bokeh.io import output_file, save from bokeh.models import ColumnDataSource, Whisker from bokeh.plotting import figure, show from bokeh.sampledata.autompg2 import autompg2 from bokeh.transform import factor_cmap dataFrame = autompg2[["class", "cty"]].rename(columns={"class": "kind"}) kinds = dataFrame.kind.unique() #compute quartiles quartilesDF = dataFrame.groupby("kind").cty.quantile([0.25, 0.5, 0.75]) quartilesDF = quartilesDF.unstack().reset_index() quartilesDF.columns = ["kind", "q1", "q2", "q3"] dataFrame = pd.merge(dataFrame, quartilesDF, on="kind", how="left") #compute IQR outlier bounds iqr = dataFrame.q3 - dataFrame.q1 dataFrame["upper"] = dataFrame.q3 + 1.5*iqr dataFrame["lower"] = dataFrame.q1 - 1.5*iqr source = ColumnDataSource(dataFrame) #create plot myPlot = figure(x_range=kinds, tools="", toolbar_location=None, title="City driving MPG distribution by vehicle class", background_fill_color="#bbbfbf", y_axis_label="Feul efficiency") #outlier range whisker = Whisker(base="kind", upper="upper", lower="lower", source=source) whisker.upper_head.size = whisker.lower_head.size = 20 myPlot.add_layout(whisker) #colour pallete cmap = factor_cmap("kind", "TolRainbow7", kinds) #quartile boxes myPlot.vbar("kind", 0.7, "q2", "q3", source=source, color=cmap, line_color="black") myPlot.vbar("kind", 0.7, "q1", "q2", source=source, color=cmap, line_color="black") # outliers outliers = dataFrame[~dataFrame.cty.between(dataFrame.lower, dataFrame.upper)] myPlot.scatter("kind", "cty", source=outliers, size=6, color="black", alpha=0.3) output_file("output.html") show(myPlot)
Lines 1–6: Import all the necessary libraries and modules.
Line 8: Select class
and cty
column from autompg2
dataset to create a new dataFrame
and rename()
class
column as kind
. Note that it is not necessary to rename, but we do it for ease to refer it in the code.
Line 10: Extract all the unique values from the kind
column and assign the values to the kinds
variable.
Line 13: Use groupby()
to group the kind column and calculate the quartiles for the cty
column. The obtained pandas series is then assigned to the quartileDS
data frame.
Lines 14–15: Create separate columns for each quartile using unstack()
and assign names to each column.
Line 16: Merge the data frames dataFrame
and quartilesDF
, according to the kind
column and using the left joint.
Lines 19–21: Calculate the interquartile range, i.e., the difference between the 75th and 25th percentile, and assign it to iqr
variable. Then save the upper and lower bounds in new dataFrame
columns.
Note: We multiply the
iqr
with 1.5 because it is a widely accepted convention to use it when calculating the bounds in inter-quartile range.
Line 23: Create a ColumnDataSource
object and assign the dataFrame
to it so the data can be provided to the plot.
Lines 26–28: Create myPlot using figure()
function and pass all the specifications as parameters. Set x-range as kinds
and specify the title, y-axis label, and background color for the plot.
Line 31: Create a whisker
object using Whisker()
and pass the base, upper, and lower as parameters.
Line 32: Specify the upper_head
and lower_head
size for the whisker
as it represents the length of them in the plot.
Line 33: Add the whisker
plot to the myPlot
figure using the add_layout()
method.
Line 36: Select the color palette for the kind
column's attributes using the factor_cmap()
function and assign them to cmap
.
Lines 39–40: Create the quartile boxes on myPlot
using the vbar()
function and pass the column name, quartiles, source, and color palette as parameters. Call the function twice for the upper and lower quartile, respectively.
Line 43: Identify the outlying rows from the dataFrame
where the cty
column values are not between the upper and lower bound and assign them to outliers
.
Line 44: Create the scattered points for the outliers
using scatter()
and pass the column, source, size, color, and transparency as parameters.
Lines 46–47: Set the output to output.html
to specify the endpoint where the plot will appear and using show()
to display the created plot.
The box plot is displayed at the output.html endpoint with TolRainbow7
color palette boxes, #bbbfbf
shade grid, and whiskers and labels as specified in the code.
Can we modify the visual appearance of the plot?
Free Resources