Data Scrubbing Operation: Removing Variables
This lesson will introduce you to ways of removing redundant or unhelpful data variables.
We'll cover the following
Quick overview
Preparing data for further processing generally starts by removing variables that aren’t compatible with the chosen algorithm or variables that are deemed less relevant to your target output. Determining which variables to remove from the dataset is generally done using exploratory data analysis and domain knowledge.
Speaking of exploratory data analysis, it is often helpful to start by checking the data type of your variables (i.e., string, Boolean, integer, etc.) and the correlation between variables. Domain knowledge, meanwhile, is useful for spotting duplicate variables, such as country and country code, and eliminating less relevant variables like latitude and longitude.
Note: In Python, variables can be removed from the dataframe using the
del
function alongside the variable name of the dataframe and the title of the column you wish to remove. The column title should be nested inside quotation marks and square brackets, as shown here:
del df['latitude']
del df['longitude']
Note: this code example, in addition to other changes made inside your notebook, won’t affect or alter the source file of the dataset. You can even restore variables removed from the development environment by deleting the code’s relevant line(s). In fact, it’s common to reverse the removal of features when testing the model using different combinations of variables.
Get hands-on with 1400+ tech skills courses.