Pandas DataFrame Operations - Dealing With Missing and Duplicates
8. Dealing With Missing Values
The difference between fake data and real-world data is that real data is rarely clean and homogeneous. One particular issue that we need to tackle when working with real data is that of missing values. And it’s not just about values being missing, different data sources can indicate missing values in different ways as well.
The two flavors in which we are likely to encounter missing or null values are:
- None: A Python object that is often used for missing data in Python. None can only be used in arrays with data type ‘object’ (i.e., arrays of Python objects).
- NaN (Not a Number): A special floating-point value that is used to represent missing data. A floating-point type means that, unlike with None’s object array, we can perform mathematical operations. However, remember that, regardless of the operation, the result of arithmetic with NaN will be another NaN.
Run the examples in the code widget below to understand the difference between the two. Observe that performing arithmetic operations on the array with the None type throws a run-time error while the code executes without errors for NaN:
import numpy as npimport pandas as pd# Example with NoneNone_example = np.array([0, None, 2, 3])print("dtype =", None_example.dtype)print(None_example)# Example with NaNNaN_example = np.array([0, np.nan, 2, 3])print("dtype =", NaN_example.dtype)print(NaN_example)# Math operations fail with None but give NaN as output with NaNsprint("Arithmetic Operations")print("Sum with NaNs:", NaN_example.sum())print("Sum with None:", None_example.sum())
Pandas is built to handle both NaN and None, and it treats the two as essentially interchangeable for indicating missing or null values. Pandas also provides us with many useful methods for detecting, removing, and replacing null values in ...