Descriptive Statistics

Build an understanding of fundamental descriptive statistical terms.

Statistics is used in business, medical research, social sciences, manufacturing, and many more areas. We use statistics to analyze data and make informed decisions. It helps us take a general look at any given data and identify patterns, trends, and relationships, which we can then use to make predictions and identify opportunities for improvement.

Descriptive statistics is a branch of statistics that deals with the collection, analysis, interpretation, presentation, and organization of data. We use descriptive statistics methods to describe and summarize the main features of a dataset, using measures such as central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). The aim is to provide a concise and meaningful overview of the data.

Press + to interact
Samples from two populations with different standard deviations
Samples from two populations with different standard deviations

Measures of central tendency

Measures of central tendency involve the calculations about values located in the middle. They hint at the general characteristics and status quo of the data. Even though these terms are widely known, it is best to review them briefly.

Mean

The mean is also called the average value of data. It is calculated by dividing the sum of the values by the number of values. It is denoted by the μ\mu sign. We can find the mean of data using the mean() function in R. The generic formula is as follows:

Press + to interact
A mean joke
A mean joke

Median

The median refers to the value located in the middle when the elements of a sample are sorted. However, data does not have one middle number when the number of elements is even. In this case, the median is calculated by taking the average of two middle numbers. We can use the median() function to find the median value of data in R.

median(<data>) # Syntax structure

Mode

Mode means the most frequent value in a sample. R does not have a built-in function for this calculation. However, we can find the most frequent element by creating our own function. The following formula returns the mode value in R.

# Syntax structure
names(which.max(table(<sample>)))

The formula finds the number of different categories in data. Then, it chooses the most frequent one and returns the category name.

Here is a simple display of the mean, median, and mode in a dataset.

Press + to interact
Measures of central tendency
Measures of central tendency

In the illustration above, we sort the original dataset and choose the central tendency points. The most frequent value is 2 (with 3 repetitions), the one in the middle of the data is 18, and the average of the data points is 25.63 (282 divided by 11).

Let’s practice the syntax of central tendency measures for the iris dataset.

Press + to interact
# We use the "iris" dataset in this exercise
# Calculate mean values
print('--------- Mean of the `Sepal.Length` column -------------------')
mean(iris$Sepal.Length) # Calculate the mean of the "Sepal.Length" column
print('---------- Mean of the `Sepal.Width` column -----------------')
mean(iris$Sepal.Width) # Calculate the mean of the "Sepal.Width" column
print('-------- Median of the `Petal.Length` column --------------------------')
# Calculate median values
median(iris$Petal.Length) # Find the median of the "Petal.Length" column
print('----------- Median of the `Petal.Width` column ----------------------')
median(iris$Petal.Width) # Find the median of the "Petal.Width" column
print('----------- The number of unique categories in the `Species` column ------------')
# Find the mode values
table(iris$Species) # Find the number of unique values
print('The category name and the occurrence number of the most repeated category')
which.max(table(iris$Species)) # Find the unique value with the highest frequency
print('---------- The category with the highest occurrence -------------------')
names(which.max(table(iris$Species))) # Isolate the name of the mode value in "Species" column

Percentiles and quantiles

A quantile is a general term used to describe any set of values that divide a dataset into equal-sized groups. For example, quartiles are a type of quantile that divide a dataset into four equal parts. ...