Data Distribution

Learn about different types of data distributions and their use cases.

Understanding data distribution

Data distribution represents how data points are distributed along the axes between the max and min values. In other words, the distribution of data represents how many times a data point occurs in the data. Distribution tells us many things about the data. For example, we know if the data is symmetrical or if it piles up in a specific range by looking at the distribution type. There are various data distribution types, all having different characteristics.

Press + to interact
Distribution of data
Distribution of data

We need to know these two terms to understand the distribution types:

  • Continuous data refers to data that can take any value within a specified range. For example, the weight of an object can be considered continuous data because it can take an infinite number of values within a particular range. In R, the double data type represents continuous data.

  • Discrete data, on the other hand, refers to data that can only take specific values. These values are usually whole numbers (integers) or countable data points and are distinct from one another with no intermediate values. For example, the number of siblings a person has or the number of books in a library are examples of discrete data.

Normal distribution

The normal distribution has a bell-curve shape representing a symmetric continuous distribution. The majority of the data falls around the mean, with fewer data points at the extremes. The normal distribution is also known as the Gaussian distribution. The mean, median, and mode are at the same point (center) in a normal distribution.

Press + to interact
Normal distribution
Normal distribution

The normal distribution is often used to model real-world phenomena, such as the distribution of heights or IQ scores in a population. Many statistical procedures assume that the data being analyzed follows a normal distribution, so understanding and identifying normal distributions is an essential part of statistical analysis.

The empirical rule states that for a normal distribution:

  • 68% of the data falls within one standard deviation of the mean.
  • 95% of the data falls within two standard deviations of the mean.
  • 99.7% of the data falls within three standard deviations of the mean.
Press + to interact
Empirical rule
Empirical rule

The probability of picking a value increases as it approaches the mean. For example, the number of people who are 30 years of age is higher than the number of people who are 90 years of age.

R provides functions to create random datasets with different distribution types. This is very useful when testing various strategies in our analysis. For example, the rnorm() function can be used to create a dataset with a ...