Statistical Methods
Learn how to apply basic statistical operations in pandas.
We'll cover the following...
Importance of statistics in data analysis
Statistics plays a crucial role in data analysis, providing methods to summarize, organize, and make inferences from our data. The application of statistical methods enables us to draw meaningful insights from data exploration en route to making data-driven decisions. The pandas
library contains a range of basic statistical methods for gaining a strong understanding of our data, which we’ll explore using the credit card dataset.
Central tendency
Central tendency is a statistical measure describing a dataset's center or typical value. There are three central tendency measures—mean, median, and mode. For example, we can find the mean of the Rating
column, the median of the Income
column, and the mode of the Cards
column with the mean()
, median()
, and mode()
methods, respectively.
# Meanmean_val = df['Rating'].mean()print(f'The mean Rating value is {mean_val}')# Medianmedian_val = df['Income'].median()print(f'The median Income value is {median_val}')# Modemode_val = df['Cards'].mode()print(f'The mode of Cards is {mode_val}')
Notice that the mode returns two values, 0 and 2. This implies that we have a bimodal mode, which indicates that the data has two distinct values that occur more frequently than the other values.
Variability
Data variability refers to the degree to which the values in a dataset differ from the central tendency. It measures how spread out the data is, and it’s an important aspect of understanding the distribution of a dataset. There are various measures of variability, as shown in the examples below:
Range: Measures the difference between the highest and lowest value and is calculated with
max()
andmin()
.
# Calculate range of Limit columnrange_val = df['Limit'].max() - df['Limit'].min()print(f'The range of the Limit column is {range_val}')
Variance: Measures the degree of spread about the mean value of the data and is calculated with
var()
.
# Calculate variance of Income columnvariance = df['Income'].var()print(f'The variance of the Income column is {round(variance,2)}')