When working with data, your analysis and insights are only as good as the data you use. If you’re performing data analysis with dirty data, your organization can’t make efficient and effective decisions with that data. Data cleaning is a critical part of data management that allows you to validate that you have a high quality of data.
Data cleaning includes more than just fixing spelling or syntax errors. It’s a fundamental aspect of data science analytics and an important machine learning technique. Today, we’ll learn more about data cleaning, its benefits, issues that can arise with your data, and the next steps for your learning.
We’ll cover:
Learn essential machine learning techniques
In this learning path, you’ll learn essential ML techniques to help you stand out from the competition.
Data cleaning, or data cleansing, is the important process of correcting or removing incorrect, incomplete, or duplicate data within a dataset. Data cleaning should be the first step in your workflow. When working with large datasets and combining various data sources, there’s a strong possibility you may duplicate or mislabel data. If you have inaccurate or incorrect data, it will lose its quality, and your algorithms and outcomes become unreliable.
Data cleaning differs from data transformation because you’re actually removing data that doesn’t belong in your dataset. With data transformation, you’re changing your data to a different format or structure. Data transformation processes are sometimes referred to as data wrangling or data munging. The data cleaning process is what we’ll focus on today.
So, how do I know if my data is clean?
To determine data quality, you can study its features and weigh them according to what’s important to your organization and your project.
There are five main features to look for when evaluating your data:
Now that we know how to recognize high-quality data, let’s dive deeper into the process of data science cleaning, why it’s important, and how to do it effectively.
Let’s discuss some cleaning steps you can take to ensure you’re working with high-quality data. Data scientists spend a lot of time cleaning data because once their data is clean, it’s much easier to perform data analysis and build models.
First, we’ll discuss some issues you could experience with your data and what to do about them.
It’s common for large datasets to have some missing values. Maybe the person recording the data forgot to input them, or maybe they began collecting those missing data variables late into the data collection process. No matter what, missing data should be managed before working with datasets.
Outliers hold essential information about your data, but at the same time take your focus away from the main group. It’s a good idea to examine your data with and without outliers. If you discover you want to use them, be sure to choose a robust method that can handle your outliers. If you decide against using them, you can just drop them.
You can also filter out unwanted outliers by using this method:
# Get the 98th and 2nd percentile as the limits of our outliersupper_limit = np.percentile(train_df.logerror.values, 98)lower_limit = np.percentile(train_df.logerror.values, 2)# Filter the outliers from the dataframedata[‘target’].loc[train_df[‘target’]>upper_limit] = upper_limitdata[‘target’].loc[train_df[‘target’]<lower_limit] = lower_limit
The data in your feature variables should be standardized. It makes examining and modeling your data a lot easier. For example, let’s look at two values we’ll call “dog” and “cat” that are in the “animal” variable. If you collected data, you may receive different data values that you didn’t anticipate, such as:
If we converted the feature variable into categorical floats, we wouldn’t get the 0 and 1 values that we want, we’d get something more like this:
{'dog': 0,'cat': 1,'DOG': 2,'CAT': 3,'Dog': 4,'Cat': 5,'dof': 6,'cart': 7}
To effectively deal with the capitalization issues and help standardize your data, you can do something like this:
# Make the string lowercases.lower()# Make the first letter capitalizeds.capitalize()
If there’s an issue with typos, you can use a mapping function:
value_map = {'dof': 'dog', 'cart': 'cat'}pd_dataframe['animals'].map(value_map)
Note: Another way to deal with typos is to run a spelling and grammar check in Microsoft Excel.
Learn essential machine learning techniques without scrubbing through videos or documentation. Educative’s text-based learning paths are easy to skim and feature live coding environments to make learning fun and efficient.
Sometimes you may have some irrelevant data that should be removed. Let’s say you want to predict the sales of a magazine. You’re examining a dataset of magazines ordered from Amazon over the past year, and you notice a feature variable called “font-type” that notes which font was used in the book.
This is a pretty irrelevant feature, and it probably wouldn’t help you predict the sales of a magazine. This is a feature that could be dropped like this:
df.drop('feature_variable_name', axis=1)
Removing those unwanted observations not only makes data exploration easier but also helps train your machine learning model.
Dirty data includes any data points that are wrong or just shouldn’t be there. Duplicates occur when data points are repeated in your dataset. If you have a lot of duplicates, it can throw off the training of your machine learning model.
To handle dirty data, you can either drop them or use a replacement (like converting incorrect data points into the correct ones). To handle duplication issues, you can just drop them from your data.
You obviously can’t use blank data for data analysis. Blank data is a major issue for analysts because it weakens the quality of the data. You should ideally remove blank data in the data collection phase, but you can also write a program to do this for you.
White space is a small but common issue within many data structures. A TRIM function will help you eliminate white space.
Note: The TRIM function is categorized under Excel text functions. It helps remove extra spaces in data. You can use the
=TRIM(text)
formula.
Sometimes, when exporting data, numeric values get converted into text. The VALUE method is a great way to help with this issue.
The data cleansing process sounds time-consuming, but it makes your data easier to work with and allows you to get the most out of your data. Having clean data increases your efficiency and ensures you’re working with high-quality data.
Some benefits of data cleaning include:
Data cleaning is an important part of your organization’s data management workflow. Now that you’ve learned more about this process, you’re ready to learn more advanced concepts within machine learning. Here are some recommended things to learn:
To get up to speed with the modern techniques in machine learning, check out Educative’s Learning Path Become a Machine Learning Engineer. In this learning path, you’ll explore essential machine learning techniques to help you stand out from the competition. By the end, you’ll have job-ready skills in data pipeline creation, model deployment, and inference.
Happy learning!
Free Resources