Home/Blog/Data Science/What is Data Cleaning? Data Science in 5 Minutes

What is Data Cleaning? Data Science in 5 Minutes

Q: Why is data cleaning important in data science?

Data cleaning ensures your data is accurate, consistent, and reliable. Without clean data, your analysis or machine learning models may produce misleading or incorrect results, leading to poor decision-making.

Q: What is the difference between data cleaning and data transformation?

Data cleaning focuses on correcting or removing incorrect, incomplete, or duplicate data to improve quality. On the other hand, data transformation changes the format or structure of the data, such as converting units or restructuring the dataset for analysis.

Q: How do I handle missing data in a dataset?

You can handle missing data by removing rows or columns with too many missing values or by filling in gaps using the mean, median, or mode methods. The choice depends on the extent of the missing data and its impact on your analysis.

12 min read

Mar 24, 2025

content

What is data cleaning?

Why does data cleaning matter?

Characteristics of high-quality data

Data cleaning steps

1. Handling missing data

Handling missing data in Python

Handling missing data with the Excel function

2. Filtering unwanted outliers

Filtering outliers in Python

Filtering outliers in Excel

3. Fixing conversion errors

Fixing conversion errors in Python

4. Standardizing your data

a. Handling case sensitivity

Case standardization in Python

Case standardization with Excel function

b. Eliminating whitespace

Removing whitespace in Python

Removing whitespace with Excel

c. Dealing with duplicate categories and typos

Handling duplicate categories and typos in Python

Handling duplicate categories and typos in Excel

5. Removing irrelevant features

Removing irrelevant features in Python

Removing irrelevant features in Excel

6. Dropping dirty data and duplicate rows

Removing duplicate data in Python

Removing duplicate data with Excel

Benefits of data cleaning

Conclusion: How to clean data powers better decisions

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

What is data cleaning?#

When working with data, your insights and analysis are only as good as the data you use. Dirty data leads to unreliable conclusions, making it impossible for organizations to make efficient and effective decisions.

That’s where data cleaning comes in. This critical part of data management ensures your datasets are accurate, consistent, and ready for analysis. It’s about more than fixing typos or syntax errors—data cleaning is foundational to data science and machine learning, playing a key role in delivering trustworthy results.

In this blog, we’ll explore:

Why is data cleaning essential?
Common challenges in messy datasets
Step-by-step techniques to clean the data using Python and Excel

Let’s dive into the details of this indispensable skill and see how it can unlock the full potential of your analysis.

Why does data cleaning matter?#

Data cleaning, or data cleansing, is the important process of correcting or removing incorrect, incomplete, or duplicate data within a dataset. Data cleaning should be the first step in your workflow. When working with large datasets and combining various data sources, there’s a high likelihood of duplicating or mislabeling data. Inaccurate or low-quality data undermines reliability, making algorithms and outcomes untrustworthy.

Data cleaning differs from data transformation because, with data cleaning, you’re correcting or removing data that doesn’t belong in your dataset. On the other hand, with data transformation, you change your data to a different format or structure to prepare it for analysis. Data transformation processes are sometimes referred to as data wrangling or data munging.

Today, we’ll focus on data cleaning. To determine data quality, you can study its features and weigh them according to what’s important to your organization and project.

Characteristics of high-quality data#

Always look for the following five main features when evaluating your data:

Consistency: Is your data consistent across datasets?
Accuracy: Is your data close to the true values?
Completeness: Does your data include all required information?
Validity: Does your data correspond with business rules and/or restrictions?
Uniformity: Is your data specified using consistent units of measurement?

Now that we’ve learned how to recognize high-quality data, let’s explore the data cleaning process, why it’s important, and how to do it effectively.

Data cleaning steps#

Let’s discuss some cleaning steps to ensure you work with high-quality data. Data scientists spend a lot of time cleaning data because it’s much easier to analyze and build models once their data is clean. This section will explore common data issues, why they occur, and practical steps in Python and Microsoft Excel to resolve them.

1. Handling missing data#

It’s common for large datasets to have some missing values. Maybe the person recording the data forgot to input them, or maybe they began collecting those missing data variables late into the data collection process. No matter what, missing data should be managed before working with datasets. Missing data can be managed by:

Removing rows or columns having excessive missing values.
Imputing values using statistical methods like the mean, median, or mode.

Handling missing data in Python#

This Python code replaces NaN values in each column by the column’s mean to handle missing data.

Python

import pandas as pd
import numpy as np
# Initial dataset
data = {
    'Name': [' Alice ', 'bob', 'CHARLIE', 'Alice ', 'dave '],
    'Age': [25, 30, np.nan, 25, 35],
    'Salary': [50000, 70000, np.nan, 50000, 1000000],  # 1M is an outlier
    'Department': ['hr', 'sales', 'SALES', 'dvelopmnt', 'HR'],  # Typos and inconsistent casing
    'PerformanceScore': [85, 92, 40, 88, 300],  # Outlier in performance score
    'Hobby': ['Reading', 'Traveling', 'Gaming', 'Cooking', 'Hiking'],
    'JoinDate': ['2022-01-15', '2020-06-01', '2021-07-01', '2021-03-20', '2019-12-01']
}
df = pd.DataFrame(data)
print("Initial Dataset:\n", df)
# Replace missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("After Handling Missing Data:\n", df)

Handling missing data with the Excel function#

Use =IF(ISNA(A2), AVERAGE(A:A), A2) to replace a missing value in the A2 cell with the average of column A.

The IF function, written as =IF(condition, value1, value2), evaluates a condition and returns one of two values: value1 if the condition is true and value2 if it is false.

2. Filtering unwanted outliers#

Outliers hold essential information about your data, but at the same time, they may also skew your analysis. It’s a good idea to examine your data with and without outliers. If you feel you want to use them, be sure to choose a robust method that can handle your outliers. If you decide against using them, you can just drop them.

In the code above, the outliers in the Salary and PerformanceScore columns of the df DataFrame are capped based on thresholds (upper_salary_limit and upper_perf_limit) calculated from the df DataFrame. Values above the upper limit are set to upper_salary_limit and upper_perf_limit for both columns.

The np.where function can be used to return elements based on a condition, allowing you to efficiently replace or modify values in an array or DataFrame.

Filtering outliers in Excel#

In Excel, you can use formulas to identify and cap outliers:

Calculate the percentile thresholds: Use the PERCENTILE function to calculate the 98th and 2nd percentiles of the data range.

For example:

=PERCENTILE(A1:A100, 0.98)  # Finds the 98th percentile of the range A1:A100.
=PERCENTILE(A1:A100, 0.02)  # Finds the 2nd percentile of the range A1:A100.

Cap the outliers: Use the IF function to replace values above or below these thresholds with the threshold values.

For example, this formula caps the value in a cell (based on the 98th and 2nd percentiles of the data in column A) and places the result back in the same cell:
```
=IF(A1 > $B$1, $B$1, IF(A1 < $C$1, $C$1, A1))
```
Here:
- A1 is the cell containing the value to check.
- $B$1 is the cell containing the 98th percentile threshold.
- $C$1 is the cell containing the 2nd percentile threshold.
- If A1 exceeds the 98th percentile, it is capped at $B$1. If A1 is less than the 2nd percentile, it is capped at $C$1. Otherwise, it retains its original value.
Tip: As we’ve used A1 (without dollar signs) instead of $A$1 in the formula, the reference becomes relative. If you copy or drag the formula to another cell, Excel automatically updates the cell references to match the new location where the formula is copied or dragged. For example, this formula can be pasted in the cell D1 and dragged to apply to the entire column.

4. Standardizing your data#

The data in your feature variables should be standardized. It makes examining and modeling your data a lot easier. Inconsistent data values can make your dataset messy and unreliable, leading to errors during analysis or modeling. Variations in capitalization, typos, or inconsistent formatting can cause similar entries to be treated as entirely different categories. Let’s explore common data standardization challenges and the techniques to address them.

a. Handling case sensitivity#

Data values often vary in capitalization, leading to inconsistencies. For instance, “DOG” and “dog” might be treated as different categories. Standardizing capitalization ensures uniformity.

Case standardization in Python#

Python offers a function lower() that lowercases the text as follows:

Case standardization with Excel function#

In Excel, use the LOWER function to convert text to lowercase. The syntax is =LOWER(A1). The function will return the content of the A1 cell in all lowercase letters.

Similarly, you can achieve text capitalization using the PROPER function. This function automatically converts text so that each word starts with an uppercase letter, and all other letters are lowercase. Its syntax is =PROPER(A1).

You can apply these formulas to any cell with text that needs to be formatted, and Excel will automatically update it accordingly.

b. Eliminating whitespace#

Whitespace is a common but often overlooked issue in datasets, potentially causing problems with string matching and grouping.

Removing whitespace in Python#

In Python, you can use the strip() function to remove leading and trailing spaces as follows:

Removing whitespace with Excel#

In Excel, use the TRIM function to remove unnecessary spaces. The syntax is =TRIM(A2). This function removes all leading, trailing, and extra spaces between words, leaving only a single space between words.

c. Dealing with duplicate categories and typos#

Variations in storing categories can be common when working with categorical data. For example, “dog” and “DOG” represent the same category but are treated separately due to case differences. Similarly, typos such as “dof” instead of “dog” can create inconsistencies in your dataset. Cleaning your data ensures that these categories are standardized and merged appropriately.

Handling duplicate categories and typos in Python#

You can handle both case differences and typos in Python in categorical data. After ensuring all the text is in a standard format (i.e., lowercase), use a mapping to replace incorrect values with the correct category.

Here’s how:

After performing this step, all variations of “hr” will be merged under one standard value: “human resources.” Similarly, the typo “dvelopmnt” will be corrected to “development”.

Handling duplicate categories and typos in Excel#

In Excel, you can handle duplicate categories by standardizing the case and correcting typos:

Standardize case: Use the LOWER function to convert all text to lowercase. Example: =LOWER(A1).
Correct typos: Use the IF function to replace incorrect entries. Example: =IF(A1="dvelopmnt", "development", A1).

If your dataset contains more complex typos, you can use Excel’s “Find” and “Replace” features to quickly locate and correct them.

Another approach is creating a separate mapping table listing common misspellings alongside the correct values. You can then use functions like VLOOKUP or XLOOKUP to replace the incorrect values with the correct ones. This method is especially helpful when you have consistent typos across the dataset or need to standardize values (e.g., replacing “Appl” and “Apl” with “Apple”).

Lastly, you can use Excel’s Spelling and Grammar Check (under the “Review” tab) to catch potential errors. Remember that this feature works best for dictionary words, so it may not always catch typos in names, codes, or specialized terms.

Once you’ve handled duplicates and typos, you can remove any remaining duplicates using the “Remove Duplicates” feature under the “Data” tab to ensure consistency.

5. Removing irrelevant features#

Sometimes you may have some irrelevant data that should be removed. Let’s say you want to predict the sales of a magazine. You’re examining a dataset of magazines ordered from Amazon over the past year, and you notice a feature variable called “font-type” that implies which font was used in the book. This feature is irrelevant and unlikely to help predict a magazine’s sales.

Removing irrelevant features in Python#

Our dataset has an irrelevant feature of Hobby that does not add value to the dataset. This feature can be removed as follows:

Removing irrelevant features in Excel#

Likewise, in Excel, if you have an irrelevant column that doesn’t add value to your analysis, you can easily delete it by right-clicking the respective column header and choosing “Delete” from the context menu.

Removing irrelevant features like this makes data exploration easier and helps train your machine-learning model more efficiently.

6. Dropping dirty data and duplicate rows#

Dirty data includes any points that are wrong or shouldn’t be there. Duplicates occur when data points are repeated in your dataset.

Duplicate data points can lead to inaccurate analytics at a later stage. For example, in an online store’s sales report, duplicates might show that a product sold more than it did. Removing duplicates ensures the data is accurate and decisions are better informed.

To handle dirty data, you can either drop problematic data points or use a replacement (like converting incorrect data points into the correct ones).

Removing duplicate data in Python#

To remove duplicate rows in Python using pandas, you can use the drop_duplicates() function. This will eliminate all rows that have the same values across all columns. Here’s how to do it:

Removing duplicate data with Excel#

Use the “Remove Duplicates” feature under the “Data” tab.

You can use Excel’s “Remove Duplicates” feature to ensure your dataset is clean and available under the “Data” tab. This feature lets you quickly identify and remove identical rows in one or more columns. Here’s how you can use it:

Select the range of cells or the entire dataset where you want to remove duplicates.
Go to the “Data” tab on the Ribbon.
Click “Remove Duplicates” in the “Data Tools” group.
Choose which columns to check for duplicates or select all columns.
Click “OK,” and Excel will remove the duplicates, leaving unique rows.

This method is simple and effective for cleaning up your data without formulas.

Benefits of data cleaning#

The data cleaning process sounds time-consuming, but it makes your data easier to work with and allows you to get the most out of it. Having clean data ensures you’re working with high-quality data. Some benefits of data cleaning include:

Reduced errors: You can better monitor your errors to help you eliminate incorrect, corrupt, or inconsistent data. You will make fewer errors overall.
Efficient analysis: Besides Python and Excel, data cleaning tools, such as DemandTools or Oracle Enterprise Data Quality, help increase efficiency and speed up the decision-making process.
Improved decision-making: With thorough and mindful data cleaning, you’ll gain better and more actionable insights from the data.

Conclusion: How to clean data powers better decisions#

In summary, data cleaning is a critical step in ensuring the accuracy and reliability of any analysis. By addressing issues like missing values, duplicates, and inconsistencies, you can extract meaningful insights that lead to well-informed decisions. Neglecting this process risks making decisions based on flawed or incomplete data, which can have significant consequences.

A clean dataset paves the way for more robust models and actionable outcomes. Now that you’ve learned about this process in depth, you’re ready to learn more advanced concepts within data science.

Frequently Asked Questions

Why is data cleaning important in data science?

Data cleaning ensures your data is accurate, consistent, and reliable. Without clean data, your analysis or machine learning models may produce misleading or incorrect results, leading to poor decision-making.

What is the difference between data cleaning and data transformation?

Data cleaning focuses on correcting or removing incorrect, incomplete, or duplicate data to improve quality. On the other hand, data transformation changes the format or structure of the data, such as converting units or restructuring the dataset for analysis.

How do I handle missing data in a dataset?