Data scientists and researchers need to collect data for running tests, analyzing scenarios, and testing hypotheses. An ideal situation might be to obtain data from the entire population of the subject in question. However, this situation is not feasible. Lack of resources means data scientists must rely on data samples of the subject population.
Data samples are derived from the population that is being studied. The aim is to obtain samples that can represent the population so that the findings applicable to the sample can be generalized to the population.
The illustration below shows the difference between population and sample:
There are several ways data can be sampled from a target population. Sampling techniques can be divided into two broad categories:
Probability sampling: Every element of the population has an equal chance of getting selected and being a part of the sample space. Probability samples tend to be more representative of the population.
Non-probability sampling: Every element of the population does not have an equal chance of getting selected. This method of sampling might not always represent the population as a whole.
We will now discuss techniques that fall under the category of probability sampling:
Simple Random Sampling or SRS is of the simplest methods of sampling that selects a subject randomly based on probability. Each element has an equal chance of getting selected. Sampling is usually done by assigning numbers to each sample and carrying out a lucky draw.
In the illustration on the right, each individual has a chance of of getting selected.
In stratified sampling, elements are first sub-grouped based on common characteristics such as gender, age, income level, profession, etc. These subgroups are known as stratas. Elements are then sampled from each strata. This method ensures that sampled data has representation from all subgroups.
The illustration on the right creates stratas based on profession and then samples them.
Elements are homogeneous within stratas.
It is not necessary that there is an equal number of elements within each strata.
Each element within a strata has an equal probability of being selected.
In cluster sampling, we divide our target population into subgroups known as clusters and then choose a cluster at random. Each cluster has an equal chance of getting selected.
The illustration on the right shows each cluster having a chance of of being selected.
Elements within clusters are heterogeneous.
We will now discuss techniques that fall under the category of non-probability sampling:
In convenience sampling, samples are selected based on availability and convenience. This might include on the basis of first-come-first-serve or willingness to take part in a survey.
The illustration on the right chooses first three individuals from each line.
Convenience samples are not representative of the population since they are subject to biases such as gender, race, age, religion, etc.
Quota sampling involves selecting elements based on some pre-determined rule. This can include selecting multiples of a number, taking every fifth person to sign up, etc.
The illustration on the right shows balls that are multiples of two being selected.
Quota samples are not representative of the population as well.
Free Resources