...
/EDA for a Numerical Explanatory Variable
EDA for a Numerical Explanatory Variable
Learn about analyzing numerical data to make observations that will help in regression.
We'll cover the following...
Typing out all these summary statistic functions in summarize()
would be long and tedious. Instead, let’s use the convenient skim()
function from the skimr
package. This function takes in a data frame, skims it, and returns the commonly used summary statistics. Let’s take our evals_ch5
data frame, select()
only the outcome and explanatory variables teaching score
and bty_avg
, and pipe them into the skim()
function:
evals_ch5 %>%select(score, bty_avg) %>% skim()
For the numerical variables teaching score
and bty_avg
, it returns:
n_missing
: This is the number of missing values.complete_rate
: This is the number of non-missing or complete values.mean
: This is the average.sd
: This is the standard deviation.p0
: The 0th percentile is the value at which 0% of the observations are smaller than it (the minimum value).p25
: The 25th percentile is the value at which 25% of the observations are smaller than it (the 1st quartile).p50
: The 50th percentile is the value at which 50% of the observations are smaller than it (the 2nd quartile and more commonly called the median).p75
: The 75th percentile is the value at which 75% of the observations are smaller than it (the 3rd quartile).p100
: The 100th percentile is the value at which 100% of observations are smaller than it (the maximum value).
Looking at this output, we can see how the values of both variables distribute. For example, the mean teaching score was 4.17 out of 5, whereas the ...