Data Scrubbing Operation: Removing Variables

This lesson will introduce you to ways of removing redundant or unhelpful data variables.

We'll cover the following

Quick overview

Preparing data for further processing generally starts by removing variables that aren’t compatible with the chosen algorithm or variables that are deemed less relevant to your target output. Determining which variables to remove from the dataset is generally done using exploratory data analysis and domain knowledge.

Speaking of exploratory data analysis, it is often helpful to start by checking the data type of your variables (i.e., string, Boolean, integer, etc.) and the correlation between variables. Domain knowledge, meanwhile, is useful for spotting duplicate variables, such as country and country code, and eliminating less relevant variables like latitude and longitude.

Note: In Python, variables can be removed from the dataframe using the del function alongside the variable name of the dataframe and the title of the column you wish to remove. The column title should be nested inside quotation marks and square brackets, as shown here:

del df['latitude']
del df['longitude']

Note: this code example, in addition to other changes made inside your notebook, won’t affect or alter the source file of the dataset. You can even restore variables removed from the development environment by deleting the code’s relevant line(s). In fact, it’s common to reverse the removal of features when testing the model using different combinations of variables.


Get hands-on with 1400+ tech skills courses.