Statistical Methods

Learn how to apply basic statistical operations in pandas.

Importance of statistics in data analysis

Statistics plays a crucial role in data analysis, providing methods to summarize, organize, and make inferences from our data. The application of statistical methods enables us to draw meaningful insights from data exploration en route to making data-driven decisions. The pandas library contains a range of basic statistical methods for gaining a strong understanding of our data, which we’ll explore using the credit card dataset.

Central tendency

Central tendency is a statistical measure describing a dataset's center or typical value. There are three central tendency measures—mean, median, and mode. For example, we can find the mean of the Rating column, the median of the Income column, and the mode of the Cards column with the mean(), median(), and mode() methods, respectively.

Press + to interact
# Mean
mean_val = df['Rating'].mean()
print(f'The mean Rating value is {mean_val}')
# Median
median_val = df['Income'].median()
print(f'The median Income value is {median_val}')
# Mode
mode_val = df['Cards'].mode()
print(f'The mode of Cards is {mode_val}')

Notice that the mode returns two values, 0 and 2. This implies that we have a bimodal mode, which indicates that the data has two distinct values that occur more frequently than the other values.

Variability

Data variability refers to the degree to which the values in a dataset differ from the central tendency. It measures how spread out the data is, and it’s an important aspect of understanding the distribution of a dataset. There are various measures of variability, as shown in the examples below:

  • Range: Measures the difference between the highest and lowest value and is calculated with max() and min().

Press + to interact
# Calculate range of Limit column
range_val = df['Limit'].max() - df['Limit'].min()
print(f'The range of the Limit column is {range_val}')
  • Variance: Measures the degree of spread about the mean value of the data and is calculated with var().

Press + to interact
# Calculate variance of Income column
variance = df['Income'].var()
print(f'The variance of the Income column is {round(variance,2)}')
...