Missing Data

Learn how XGBoost handles missing data.

XGBoost’s approach to handling missing data

As a final note on the use of both XGBoost and SHAP, one valuable trait of both packages is their ability to handle missing values. Recall that in the chapter “Data Exploration and Cleaning,” we found that some samples in the case study data had missing values for the PAY_1 feature. So far, our approach has been to simply remove these samples from the dataset when building models. This is because, without specifically addressing the missing values in some way, the machine learning models implemented by scikit-learn cannot work with the data. Ignoring them is one approach, although this may not be satisfactory as it involves throwing data away. If it’s a very small fraction of the data, this may be fine; however, in general, it’s good to be able to know how to deal with missing values.

There are several approaches for imputing missing values of features, such as filling them in with the mean or mode of the non-missing values of that feature, or a randomly selected value from the non-missing values. You can also build a model outputting the feature in question as the response variable, with all the other features acting as features for this new model, and then predict the missing feature values. However, because XGBoost typically performs at least as well as other machine learning models for binary classification tasks with tabular data like we’re using here, and handles missing values, we’ll forego more in-depth exploration of imputing missing values and let XGBoost do the work for us.

How does XGBoost handle missing data? At every opportunity to split a node, XGBoost considers only the non-missing feature values. If a feature with missing values is chosen to make a split, the samples with missing values for that feature are then sent down the optimal path to one of the child nodes, in terms of minimizing the loss function.

Saving python variables to a file

In the challenge for this section, to write to and read from files we’ll use a new python statement (with) and the pickle package. with statements make it easier to work with files because they both open and close the file, instead of the user needing to do this separately. You can use code snippets like this to save variables to a file:

with open('filename.pkl', 'wb') as f: 
     pickle.dump([var_1, var_2], f)

where filename.pkl is your chosen file path, 'wb' indicates the file is open for writing in a binary format, and pickle.dump saves a list of variables var_1 and var_2 to the file. To open this file and load these variables, possibly into a separate Jupyter Notebook, the code is similar but now the file needs to be opened for reading in a binary format ('rb'):

with open('filename.pkl', 'rb') as f: 
     var_1, var_2 = pickle.load(f)

Get hands-on with 1200+ tech skills courses.