Want hands-on practice with data cleaning and visualization? Try the Time Series Analysis and Visualization Using Python and Plotly project.
When working with data, your insights and analysis are only as good as the data you use. Dirty data leads to unreliable conclusions, making it impossible for organizations to make efficient and effective decisions.
That’s where data cleaning comes in. This critical part of data management ensures your datasets are accurate, consistent, and ready for analysis. It’s about more than fixing typos or syntax errors—data cleaning is foundational to data science and machine learning, playing a key role in delivering trustworthy results.
In this blog, we’ll explore:
Why is data cleaning essential?
Common challenges in messy datasets
Step-by-step techniques to clean the data using Python and Excel
Let’s dive into the details of this indispensable skill and see how it can unlock the full potential of your analysis.
Data cleaning, or data cleansing, is the important process of correcting or removing incorrect, incomplete, or duplicate data within a dataset. Data cleaning should be the first step in your workflow. When working with large datasets and combining various data sources, there’s a high likelihood of duplicating or mislabeling data. Inaccurate or low-quality data undermines reliability, making algorithms and outcomes untrustworthy.
Data cleaning differs from data transformation because, with data cleaning, you’re correcting or removing data that doesn’t belong in your dataset. On the other hand, with data transformation, you change your data to a different format or structure to prepare it for analysis. Data transformation processes are sometimes referred to as data wrangling or data munging.
Today, we’ll focus on data cleaning. To determine data quality, you can study its features and weigh them according to what’s important to your organization and project.
Always look for the following five main features when evaluating your data:
Consistency: Is your data consistent across datasets?
Accuracy: Is your data close to the true values?
Completeness: Does your data include all required information?
Validity: Does your data correspond with business rules and/or restrictions?
Uniformity: Is your data specified using consistent units of measurement?
Now that we’ve learned how to recognize high-quality data, let’s explore the data cleaning process, why it’s important, and how to do it effectively.
Let’s discuss some cleaning steps to ensure you work with high-quality data. Data scientists spend a lot of time cleaning data because it’s much easier to analyze and build models once their data is clean. This section will explore common data issues, why they occur, and practical steps in Python and Microsoft Excel to resolve them.
It’s common for large datasets to have some missing values. Maybe the person recording the data forgot to input them, or maybe they began collecting those missing data variables late into the data collection process. No matter what, missing data should be managed before working with datasets. Missing data can be managed by:
Removing rows or columns having excessive missing values.
Imputing values using statistical methods like the mean, median, or mode.
This Python code replaces NaN
values in each column by the column’s mean to handle missing data.
import pandas as pdimport numpy as np# Initial datasetdata = {'Name': [' Alice ', 'bob', 'CHARLIE', 'Alice ', 'dave '],'Age': [25, 30, np.nan, 25, 35],'Salary': [50000, 70000, np.nan, 50000, 1000000], # 1M is an outlier'Department': ['hr', 'sales', 'SALES', 'dvelopmnt', 'HR'], # Typos and inconsistent casing'PerformanceScore': [85, 92, 40, 88, 300], # Outlier in performance score'Hobby': ['Reading', 'Traveling', 'Gaming', 'Cooking', 'Hiking'],'JoinDate': ['2022-01-15', '2020-06-01', '2021-07-01', '2021-03-20', '2019-12-01']}df = pd.DataFrame(data)print("Initial Dataset:\n", df)# Replace missing values with meandf['Age'].fillna(df['Age'].mean(), inplace=True)df['Salary'].fillna(df['Salary'].mean(), inplace=True)print("After Handling Missing Data:\n", df)
Use =IF(ISNA(A2), AVERAGE(A:A), A2)
to replace a missing value in the A2 cell with the average of column A.
The IF
function, written as =IF(condition, value1, value2)
, evaluates a condition and returns one of two values: value1
if the condition is true and value2
if it is false.
Outliers hold essential information about your data, but at the same time, they may also skew your analysis. It’s a good idea to examine your data with and without outliers. If you feel you want to use them, be sure to choose a robust method that can handle your outliers. If you decide against using them, you can just drop them.
In Python, you can filter out unwanted outliers by using the following code as a template:
# Define outlier thresholdsupper_salary_limit = np.percentile(df['Salary'], 98)upper_perf_limit = np.percentile(df['PerformanceScore'], 98)# Cap Salary and PerformanceScore outliersdf['Salary'] = np.where(df['Salary'] > upper_salary_limit, upper_salary_limit, df['Salary'])df['PerformanceScore'] = np.where(df['PerformanceScore'] > upper_perf_limit, upper_perf_limit, df['PerformanceScore'])print("After Filtering Outliers:\n", df)
In the code above, the outliers in the Salary
and PerformanceScore
columns of the df
DataFrame are capped based on thresholds (upper_salary_limit
and upper_perf_limit
) calculated from the df
DataFrame. Values above the upper limit are set to upper_salary_limit
and upper_perf_limit
for both columns.
The np.where
function can be used to return elements based on a condition, allowing you to efficiently replace or modify values in an array or DataFrame.
In Excel, you can use formulas to identify and cap outliers:
Calculate the percentile thresholds: Use the PERCENTILE
function to calculate the 98th and 2nd percentiles of the data range.
For example:
=PERCENTILE(A1:A100, 0.98) # Finds the 98th percentile of the range A1:A100.
=PERCENTILE(A1:A100, 0.02) # Finds the 2nd percentile of the range A1:A100.
Cap the outliers: Use the IF
function to replace values above or below these thresholds with the threshold values.
For example, this formula caps the value in a cell (based on the 98th and 2nd percentiles of the data in column A) and places the result back in the same cell:
=IF(A1 > $B$1, $B$1, IF(A1 < $C$1, $C$1, A1))
Here:
A1
is the cell containing the value to check.$B$1
is the cell containing the 98th percentile threshold.$C$1
is the cell containing the 2nd percentile threshold.A1
exceeds the 98th percentile, it is capped at $B$1
. If A1 is less than the 2nd percentile, it is capped at $C$1
. Otherwise, it retains its original value.Tip: As we’ve used
A1
(without dollar signs) instead of$A$1
in the formula, the reference becomes relative. If you copy or drag the formula to another cell, Excel automatically updates the cell references to match the new location where the formula is copied or dragged. For example, this formula can be pasted in the cellD1
and dragged to apply to the entire column.
# Convert JoinDate to datetimedf['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')print("After Fixing Conversion Errors:\n", df['JoinDate'])
Use =VALUE(A2)
to convert texts that represent a number into actual numeric values. However, the function will return an error if the text contains non-numeric characters (e.g., “123abc”).
The data in your feature variables should be standardized. It makes examining and modeling your data a lot easier. Inconsistent data values can make your dataset messy and unreliable, leading to errors during analysis or modeling. Variations in capitalization, typos, or inconsistent formatting can cause similar entries to be treated as entirely different categories. Let’s explore common data standardization challenges and the techniques to address them.
Data values often vary in capitalization, leading to inconsistencies. For instance, “DOG” and “dog” might be treated as different categories. Standardizing capitalization ensures uniformity.
Python offers a function lower()
that lowercases the text as follows:
# Convert all strings to lowercasedf['Department'] = df['Department'].str.lower()print("After Case Standardization:\n", df['Department'])
Another way of case standardization could be to capitalize the first letter. Python has a built-in function capitalize()
for it.
# Capitalizing all stringsdf['Department'] = df['Department'].str.capitalize()print("After Case Standardization:\n", df['Department'])
In Excel, use the LOWER
function to convert text to lowercase. The syntax is =LOWER(A1)
. The function will return the content of the A1
cell in all lowercase letters.
Similarly, you can achieve text capitalization using the PROPER
function. This function automatically converts text so that each word starts with an uppercase letter, and all other letters are lowercase. Its syntax is =PROPER(A1)
.
You can apply these formulas to any cell with text that needs to be formatted, and Excel will automatically update it accordingly.
Whitespace is a common but often overlooked issue in datasets, potentially causing problems with string matching and grouping.
In Python, you can use the strip()
function to remove leading and trailing spaces as follows:
print("Before Removing Whitespace:\n", df['Name'])# Trim whitespacedf['Name'] = df['Name'].str.strip()print("After Removing Whitespace:\n", df['Name'])
If you also want to handle whitespaces in between words, you can use pandas’ .str.replace()
function with regular expressions. Here’s how you can do it:
# Remove extra spaces between words and leading/trailing spacesdf['Name'] = df['Name'].str.replace(r'\s+', ' ', regex=True).str.strip()
In Excel, use the TRIM
function to remove unnecessary spaces. The syntax is =TRIM(A2)
. This function removes all leading, trailing, and extra spaces between words, leaving only a single space between words.
Variations in storing categories can be common when working with categorical data. For example, “dog” and “DOG” represent the same category but are treated separately due to case differences. Similarly, typos such as “dof” instead of “dog” can create inconsistencies in your dataset. Cleaning your data ensures that these categories are standardized and merged appropriately.
You can handle both case differences and typos in Python in categorical data. After ensuring all the text is in a standard format (i.e., lowercase), use a mapping to replace incorrect values with the correct category.
Here’s how:
print("Before Correcting Typos:\n", df['Department'])# Skipping lowercasing since we've already covered it# Define a mapping to correct typosdepartment_map = {'hr': 'human resources', 'sales': 'sales', 'dvelopmnt': 'development'}# Replace incorrect values using the mappingdf['Department'] = df['Department'].map(department_map)print("After Correcting Typos:\n", df['Department'])
After performing this step, all variations of “hr” will be merged under one standard value: “human resources.” Similarly, the typo “dvelopmnt” will be corrected to “development”.
In Excel, you can handle duplicate categories by standardizing the case and correcting typos:
Standardize case: Use the LOWER
function to convert all text to lowercase. Example: =LOWER(A1)
.
Correct typos: Use the IF
function to replace incorrect entries. Example: =IF(A1="dvelopmnt", "development", A1)
.
If your dataset contains more complex typos, you can use Excel’s “Find” and “Replace” features to quickly locate and correct them.
Another approach is creating a separate mapping table listing common misspellings alongside the correct values. You can then use functions like VLOOKUP
or XLOOKUP
to replace the incorrect values with the correct ones. This method is especially helpful when you have consistent typos across the dataset or need to standardize values (e.g., replacing “Appl” and “Apl” with “Apple”).
Lastly, you can use Excel’s Spelling and Grammar Check (under the “Review” tab) to catch potential errors. Remember that this feature works best for dictionary words, so it may not always catch typos in names, codes, or specialized terms.
Once you’ve handled duplicates and typos, you can remove any remaining duplicates using the “Remove Duplicates” feature under the “Data” tab to ensure consistency.
Sometimes you may have some irrelevant data that should be removed. Let’s say you want to predict the sales of a magazine. You’re examining a dataset of magazines ordered from Amazon over the past year, and you notice a feature variable called “font-type” that implies which font was used in the book. This feature is irrelevant and unlikely to help predict a magazine’s sales.
Our dataset has an irrelevant feature of Hobby
that does not add value to the dataset. This feature can be removed as follows:
# Drop duplicate rowsdf.drop('Hobby', axis=1, inplace=True)print(df)
Likewise, in Excel, if you have an irrelevant column that doesn’t add value to your analysis, you can easily delete it by right-clicking the respective column header and choosing “Delete” from the context menu.
Removing irrelevant features like this makes data exploration easier and helps train your machine-learning model more efficiently.
Dirty data includes any points that are wrong or shouldn’t be there. Duplicates occur when data points are repeated in your dataset.
Duplicate data points can lead to inaccurate analytics at a later stage. For example, in an online store’s sales report, duplicates might show that a product sold more than it did. Removing duplicates ensures the data is accurate and decisions are better informed.
To handle dirty data, you can either drop problematic data points or use a replacement (like converting incorrect data points into the correct ones).
To remove duplicate rows in Python using pandas, you can use the drop_duplicates()
function. This will eliminate all rows that have the same values across all columns. Here’s how to do it:
# Drop duplicate rowsdf.drop_duplicates(inplace=True)print("After Removing Duplicate Data:\n", df)
Use the “Remove Duplicates” feature under the “Data” tab.
You can use Excel’s “Remove Duplicates” feature to ensure your dataset is clean and available under the “Data” tab. This feature lets you quickly identify and remove identical rows in one or more columns. Here’s how you can use it:
Select the range of cells or the entire dataset where you want to remove duplicates.
Go to the “Data” tab on the Ribbon.
Click “Remove Duplicates” in the “Data Tools” group.
Choose which columns to check for duplicates or select all columns.
Click “OK,” and Excel will remove the duplicates, leaving unique rows.
This method is simple and effective for cleaning up your data without formulas.
The data cleaning process sounds time-consuming, but it makes your data easier to work with and allows you to get the most out of it. Having clean data ensures you’re working with high-quality data. Some benefits of data cleaning include:
Reduced errors: You can better monitor your errors to help you eliminate incorrect, corrupt, or inconsistent data. You will make fewer errors overall.
Efficient analysis: Besides Python and Excel, data cleaning tools, such as DemandTools or Oracle Enterprise Data Quality, help increase efficiency and speed up the decision-making process.
Improved decision-making: With thorough and mindful data cleaning, you’ll gain better and more actionable insights from the data.
In summary, data cleaning is a critical step in ensuring the accuracy and reliability of any analysis. By addressing issues like missing values, duplicates, and inconsistencies, you can extract meaningful insights that lead to well-informed decisions. Neglecting this process risks making decisions based on flawed or incomplete data, which can have significant consequences.
A clean dataset paves the way for more robust models and actionable outcomes. Now that you’ve learned about this process in depth, you’re ready to learn more advanced concepts within data science.
Why is data cleaning important in data science?
What is the difference between data cleaning and data transformation?
How do I handle missing data in a dataset?
Free Resources