Introduction to Data Analysis and Visualization with R/

...

Descriptive Statistics

Build an understanding of fundamental descriptive statistical terms.

We'll cover the following...

Measures of central tendency
Measures of variability
Variables and relationship analysis
- Dependent and independent variables

Statistics is used in business, medical research, social sciences, manufacturing, and many more areas. We use statistics to analyze data and make informed decisions. It helps us take a general look at any given data and identify patterns, trends, and relationships, which we can then use to make predictions and identify opportunities for improvement.

Descriptive statistics is a branch of statistics that deals with the collection, analysis, interpretation, presentation, and organization of data. We use descriptive statistics methods to describe and summarize the main features of a dataset, using measures such as central tendency (mean, median, mode) and dispersion (range, variance, standard deviation). The aim is to provide a concise and meaningful overview of the data.

Press + to interact

Measures of central tendency

Measures of central tendency involve the calculations about values located in the middle. They hint at the general characteristics and status quo of the data. Even though these terms are widely known, it is best to review them briefly.

Mean

The mean is also called the average value of data. It is calculated by dividing the sum of the values by the number of values. It is denoted by the $\mu$ sign. We can find the mean of data using the mean() function in R. The generic formula is as follows:

Press + to interact

Median

The median refers to the value located in the middle when the elements of a sample are sorted. However, data does not have one middle number when the number of elements is even. In this case, the median is calculated by taking the average of two middle numbers. We can use the median() function to find the median value of data in R.

median(<data>) # Syntax structure

Mode

Mode means the most frequent value in a sample. R does not have a built-in function for this calculation. However, we can find the most frequent element by creating our own function. The following formula returns the mode value in R.

# Syntax structure
names(which.max(table(<sample>)))

The formula finds the number of different categories in data. Then, it chooses the most frequent one and returns the category name.

Here is a simple display of the mean, median, and mode in a dataset.

Press + to interact

# We use the "iris" dataset in this exercise
# Calculate mean values
print('--------- Mean of the `Sepal.Length` column  -------------------')
mean(iris$Sepal.Length) # Calculate the mean of the "Sepal.Length" column
print('----------  Mean of the `Sepal.Width` column  -----------------')
mean(iris$Sepal.Width)  # Calculate the mean of the "Sepal.Width" column
print('-------- Median of the `Petal.Length` column --------------------------')
# Calculate median values
median(iris$Petal.Length)  # Find the median of the "Petal.Length" column
print('----------- Median of the `Petal.Width` column  ----------------------')
median(iris$Petal.Width)  # Find the median of the "Petal.Width" column
print('----------- The number of unique categories in the `Species` column ------------') 
# Find the mode values
table(iris$Species)  # Find the number of unique values
print('The category name and the occurrence number of the most repeated category')
which.max(table(iris$Species))  # Find the unique value with the highest frequency
print('---------- The category with the highest occurrence   -------------------')
names(which.max(table(iris$Species))) #  Isolate the name of the mode value in "Species" column

Getting Started

File Management

Data Structures

Data Cleaning

Statistical Analysis

Data Transformation

Data Visualization

Uber Data Analysis Using the R Language

Conclusion

Evaluation

Netflix Shows

Descriptive Statistics

Measures of central tendency

Mean

Median

Mode

Percentiles and quantiles