Solution: Clean the Data
This lesson gives a detailed review of the solution to the challenge from the previous lesson.
We'll cover the following...
Solution #
Press + to interact
def clean_data(df):df = df.dropna() # dropping all rows with null values# A list of all columns on which outliers need to be removedout_list = ['median_house_value', 'median_income', 'housing_median_age']quantiles_df = (df.quantile([0.25,0.75])) # computing 1st & 3rd quartilesfor out in out_list: # traversing through the listQ1 = quantiles_df[out][0.25] # Retrieving value of 1st quartileQ3 = quantiles_df[out][0.75] # Retrieving value of 3rd quartileiqr = Q3 - Q1 # computing the interquartile rangelower_bound = (Q1 - (iqr * 1.5)) # computing lower boundupper_bound = (Q3 + (iqr * 1.5)) # computing upper boundcol = df[out] # Storing reference of required columncol[(col < lower_bound)] = lower_bound # Assign outliers to lower boundcol[(col > upper_bound)] = upper_bound # Assign outliers to upper boundreturn df# Test Codedf = pd.read_csv('housing.csv')df_res = clean_data(df.copy())print(df_res)
Explanation
A function clean_data
is declared with df
passed to it as a parameter.
On line 3, the dropna()
function of the DataFrame
, which automatically finds and removes all NaN containing ...