Frequentist vs. Bayesian Statistics
Distinguish between frequentist and Bayesian statistics using multiple examples involving coding in Python in this lesson.
We'll cover the following...
Types of statistics
Frequentist and Bayesian statistics are two approaches to statistical inference used to conclude a population based on sample data.
Frequentist statistics is based on repeated sampling, where the probability of an event is determined by the relative frequency of that event occurring in a large number of independent samples. In this approach, likelihood is considered a long-term relative frequency of an event occurring and is not necessarily associated with any individual event. In frequentist statistics, statistical inference is based on hypothesis testing, where a null hypothesis is assumed to be valid until sufficient evidence is found to reject it in favor of an alternative idea.
Bayesian statistics, on the other hand, is based on subjective probability, where the likelihood of an event is determined by an individual’s belief or degree of confidence in that event occurring. In this approach, probability is considered a measure of an individual’s uncertainty about an event and can be updated as new information becomes available. In Bayesian statistics, statistical inference is based on updating our prior beliefs about an event based on new evidence, using Bayes’ theorem.
Both frequentist and Bayesian statistics have their strengths and weaknesses, and which approach is used depends on the problem being addressed and the goals of the analysis. We'll see them in detail in subsequent sections.
Frequentist statistics
Frequentist statistics is a branch that uses probability theory and assumes that a fixed set of underlying probability distributions can explain all observations. It provides inferences and predictions about a population based on data from a sample. Frequentist statistics aim to infer the population parameters from the sample data. It uses
Example
Let’s consider an example. The municipal corporation has assigned us a the task of using statistics to suggest a standard door size for all public buildings. In order to do so, we first have to find what the average height of the population in the municipality is. This is shown in the illustration below.
The ideal way is to collect the heights of all the people from the municipality and calculate their mean (average) height. However, this is not feasible. So let’s assemble a sample of 1000 people to find the population’s mean height. This is done in Python as follows:
#Importing numpy moduleimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib.pyplot import figure#Sampling the height for 1000 samplesnp.random.seed(42)pop_1000 = np.round(np.random.uniform(low=160, high=190, size=(1000,)),2)#Printing the resultprint(f"A sample from the height of thousand people is as follows.")print(pop_1000[5])figure(figsize=(8, 6), dpi=80)plt.hist(pop_1000)plt.xlabel("Heights in cm")plt.ylabel("Frequency of Heights")plt.title("Distribution of Heights for a sample of 1000 people")plt.savefig('output/graph.png')
In the code above:
- Lines 2–4: We import the required libraries.
- Lines 6–7: We generate the sample of
1000
heights using a uniform distribution. - Lines 11–15: We plot the distribution of the sample space.
From the data of the sample space generated above, we would use MLE to calculate the mean height for the population. For simplicity, this is the same as the sample mean. As the mean height calculated using MLE is of the sample, we would then use a confidence interval to provide a range of values in which the population mean is likely to lie.
#Importing numpy moduleimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib.pyplot import figure#Sampling the height for 1000 samplesnp.random.seed(42)pop_1000 = np.round(np.random.uniform(low=160, high=190, size=(1000,)),2)population_mean = pop_1000.mean()# Calculate the confidence intervalimport scipy.stats as statsalpha = 0.05confidence_level=1-alphaz_score = stats.norm.ppf(1-(alpha/2))sample_std = stats.tstd(pop_1000)margin_of_error = z_score * (sample_std / (len(pop_1000) ** 0.5))confidence_interval = (population_mean - margin_of_error, population_mean + margin_of_error)print(f"The average height of population calculated using sample is {population_mean:.2f}.\nHowever, we are not 100% sure about this. We are {confidence_level*100}% sure that the value of population mean is between {confidence_interval[0]:.2f} and {confidence_interval[1]:.2f}.")
In the code above:
- Lines 2–8: We import the required libraries and create the sample space.
- Lines 6–7: We calculate the z-score.
- Lines 15–17: We estimate the margin of error, which is used to estimate the confidence interval.