...

Exercise: Exploring and Cleaning the Data

Learn to explore and clean the data for predictive modeling.

We'll cover the following...

Examining features and data quality
Try it yourself

Thus far, we have identified a data quality issue related to the metadata: we had been told that every sample from our dataset corresponded to a unique account ID, but found that this was not the case. We were able to use logical indexing and pandas to correct this issue. This was a fundamental data quality issue, having to do simply with what samples were present, based on the metadata. Aside from this, we are not really interested in the metadata column of account IDs: these will not help us develop a predictive model for credit default.

Examining features and data quality

Now, we are ready to start examining the values of the features and response variable, the data we will use to develop our predictive model. Perform the following steps to complete this exercise:

Load the results of the previous exercise and obtain the data type of the columns in the data by using the info() method as shown below:
```
import pandas as pd
df_clean_1 = pd.read_csv('df_clean_1.csv')
df_clean_1.info()
```
You should see the following output:

Press + to interact

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Exercise: Exploring and Cleaning the Data

Examining features and data quality