Descriptive Statistics
Build an understanding of fundamental descriptive statistical terms.
Statistics is used in business, medical research, social sciences, manufacturing, and many more areas. We use statistics to analyze data and make informed decisions. It helps us take a general look at any given data and identify patterns, trends, and relationships, which we can then use to make predictions and identify opportunities for improvement.
Descriptive statistics is a branch of statistics that deals with the collection, analysis, interpretation, presentation, and organization of data. We use descriptive statistics methods to describe and summarize the main features of a dataset, using measures such as central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). The aim is to provide a concise and meaningful overview of the data.
Measures of central tendency
Measures of central tendency involve the calculations about values located in the middle. They hint at the general characteristics and status quo of the data. Even though these terms are widely known, it is best to review them briefly.
Mean
The mean is also called the average value of data. It is calculated by dividing the sum of the values by the number of values. It is denoted by the sign. We can find the mean of data using the mean()
function in R.
The generic formula is as follows:
Median
The median refers to the value located in the middle when the elements of a sample are sorted.
However, data does not have one middle number when the number of elements is even. In this case, the median is calculated by taking the average of two middle numbers.
We can use the median()
function to find the median value of data in R.
median(<data>) # Syntax structure
Mode
Mode means the most frequent value in a sample. R does not have a built-in function for this calculation. However, we can find the most frequent element by creating our own function. The following formula returns the mode value in R.
# Syntax structure
names(which.max(table(<sample>)))
The formula finds the number of different categories in data. Then, it chooses the most frequent one and returns the category name.
Here is a simple display of the mean, median, and mode in a dataset.
In the illustration above, we sort the original dataset and choose the central tendency points. The most frequent value is 2 (with 3 repetitions), the one in the middle of the data is 18, and the average of the data points is 25.63 (282 divided by 11).
Let’s practice the syntax of central tendency measures for the iris
dataset.
# We use the "iris" dataset in this exercise# Calculate mean valuesprint('--------- Mean of the `Sepal.Length` column -------------------')mean(iris$Sepal.Length) # Calculate the mean of the "Sepal.Length" columnprint('---------- Mean of the `Sepal.Width` column -----------------')mean(iris$Sepal.Width) # Calculate the mean of the "Sepal.Width" columnprint('-------- Median of the `Petal.Length` column --------------------------')# Calculate median valuesmedian(iris$Petal.Length) # Find the median of the "Petal.Length" columnprint('----------- Median of the `Petal.Width` column ----------------------')median(iris$Petal.Width) # Find the median of the "Petal.Width" columnprint('----------- The number of unique categories in the `Species` column ------------')# Find the mode valuestable(iris$Species) # Find the number of unique valuesprint('The category name and the occurrence number of the most repeated category')which.max(table(iris$Species)) # Find the unique value with the highest frequencyprint('---------- The category with the highest occurrence -------------------')names(which.max(table(iris$Species))) # Isolate the name of the mode value in "Species" column
Percentiles and quantiles
A quantile is a general term used to describe any set of values that divide a dataset into equal-sized groups. For example, quartiles are a type of quantile that divide a dataset into four equal parts. ...