What is a z-score and its significance in a dataset?

The z-score is a concept widely used in probability and statistics. It is used when data is normally distributed. To understand z-score better, we first need to know what a normal distribution is.

Normal distribution

The figure below shows a normal distribution curve:

A Normal Distribution Curve
A Normal Distribution Curve

In normally distributed data, data lying above and below the mean is proportionate. The resulting curve is of a bell shape. The center of the curve denotes the mean.

The mean, mode, and median are all equal.

The area under the curve is 1. The curve is symmetrical about the mean.

Background

Oftentimes, we need to compare values from different datasets. Let’s suppose a university accepts ACT and SAT scores for admissions. Both these tests have different metrics, cumulative scores, and hence different means. How can the university compare results from each test and decide which student performed better than the other? In such situations, we need to standardize the scores to compare them. The resulting standardized normal variable for each score is called Z.

A random normal variable X is standardized to have a mean of 0 and a standard deviation of 1.

z-score is used when the data is normally distributed.

The z-score will tell us how many standard deviations above or below the mean does a value lie.

Mathematical formulation

Let’s familiarize with some terminology before we craft a formula:

Symbol Name Purpose
z Standard Normal Variable Standardized score
XX Random Normal Variable Actual value
μ\mu mu Mean of the data
σ\sigma sigma Standard deviation of the data

To standardize a random normal variable, we need to carry out the following steps:

  • Subtract the mean (μ\mu) from the random normal variable (XX).
  • Divide the result by the standard deviation (σ\sigma)

The final formula is as follows:

z = Xμσ\frac{X - \mu}{\sigma}

The illustration below summarizes the procedure:

Standardizing a Random Normal Variable

Calculating mean

Mean is the average of all values in the data. It is calculated as follows:

  • Take the sum of all values in the dataset.
  • Divide by the total number of values.

The final formula is as follows:

μ\mu = i=1NXi\sum_{i=1}^{N} Xi

where XiXi is each Random Normal Variable and NN is the number of values.

Calculating standard deviation

Standard deviation indicates how far a value is from the mean. It is calculated as follows:

  • Subtract the mean from each value of XiXi.
  • Take the square of the result of each of the value above.
  • Add all these squares together.
  • Divide by the number of values in the dataset.
  • Take the square of the result of the previous step.

The final formula is as follows:

σ\sigma = i=1N(xiμ)2N\sqrt{\frac{\sum_{i=1}^N (x_i -{\mu})^2}{N}}

Example

We have gathered all the bits of information we need to work with z-score. Let’s work through a simple example:

Suppose 15 students in a class took a test. The professor wants to ensure that he grades them realistically. Therefore, he decides that whoever scores more than 1 standard deviation below the mean will fail while others will pass. The table below shows the summary of scores:

Student Test Scores (out of 100)
Jack 72
Jim 86
Gabe 56
Bill 92
Alice 78
Veronica 94
Angelica 32
Matt 44
Thomas 66
Dice 100
Donald 28
Rice 42
Jones 88
Chris 79
Liam 73

In order to discuss these scores in terms of standard deviation, we need to standardize them. To do so, we will calculate the z-score for each.

Remember! Standardized scores have a mean of 0 and standard deviation of 1.

Finding mean

Total number of values are 15. Therefore, N=15N = 15.

Step 1: Taking sum

Sum =72+86+56+92+78+94+32+44+66+100+28+42+88+79+73=1030= 72 + 86 + 56 + 92 + 78 + 94 + 32 + 44 + 66 + 100 + 28 +42 + 88 + 79 + 73 = 1030

Step 2: Divide by NN

μ=1030/N=1030/15=68.7\mu = 1030/N = 1030/15 = 68.7

The mean is 68.7.

Finding Standard deviation

Follow the steps discussed above to calculate the standard deviation.

It will look something like this:

σ\sigma = (7268.7)2+(8668.7)2+...+(7368.7)215\sqrt{\frac{ (72-{68.7})^2 + (86 -{68.7})^2 + ... + (73 -{68.7})^2}{15}} = 22.422.4

The standard deviation is 22.4.

Finding z-score

We can now plug these values in the formula for z-score.

z = Xμσ\frac{X - \mu}{\sigma}

For Jack:

z = 7268.722.4=0.147\frac{72 - 68.7}{22.4} = 0.147

In simpler words, Jack is 0.147 standard deviations above the mean.

We can repeat the process for all the students. The updated table below shows the z-score of each student as well:

Student Test Scores (out of 100) z-score
Jack 72 0.147
Jim 86 0.772
Gabe 56 -0.567
Bill 92 1.04
Alice 78 0.415
Veronica 94 1.129
Angelica 32 -1.638
Matt 44 -1.102
Thomas 66 -0.120
Dice 100 1.400
Donald 28 -1.817
Rice 42 -1.192
Jones 88 0.861
Chris 79 0.460
Liam 73 0.192

Results

As the table above shows, Angelica, Matt, Donald, and Rice score more than 1 standard deviation below the mean. Hence, they failed the test.

Other areas of usage

The z-score follows the same pattern of calculation in statistical inference as well. In statistical inference, we need to validate whether a hypothesis generalizes to the entire population or is only applicable to the sample data. For such purposes, statisticians carry out hypothesis testing which requires standardizing data and calculating z-scores.

Similarly, when comparing two datasets with different metrics of calculations, we can use the z-score as a standardized metric.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved