Model Evaluation
Create the model's classification report, confusion matrix, and receiver operating characteristic curve.
The evaluation process has its importance. We want our model to be as good as possible in predictions. We have learned that scikit-learn provides a convenient and efficient way to evaluate classification tasks using its classification_report
module.
Classification report
Let's import this module and use it for evaluation.
from sklearn.metrics import classification_reportprint("*****************************************************")print("Report on training data:")print(classification_report(y_train,pred_train))print("*****************************************************")print("*****************************************************\n")print("Report on test data:")print(classification_report(y_test,pred_test))print("*****************************************************")
The classification report tells us about precision
, recall
, f1-score
, and support
cases for each class and their averages. It’s all up to us. Instead of the classification report, we are more interested in the confusion matrix to calculate any specific value.
Confusion matrix
Let's get the confusion matrix using scikit-learn. We need to do another import.
from sklearn.metrics import confusion_matrix# Let's pass the y_test and predictions to get the confusion_matrixprint("Confusion Matrix from the test data:\n")print(confusion_matrix(y_test, pred_test))
It's always nice to present the results self-explanatory. The confusion matrix above can be shown using simple code in a nice-looking data frame.
from sklearn.metrics import confusion_matrixdf = pd.DataFrame(confusion_matrix(y_test, pred_test),columns=["Predicted False", "Predicted True"],index=["Actual False", "Actual True"])print(df)
Although a relatively simple model, logistic regression is widely used in real-world problems. The coefficients are interpretable, and we can understand how features X
affect the target y
. Another advantage of logistic regression is that it usually does not suffer from high variance due to many simplifying assumptions in the model.
Predict the probabilities instead of class
Using the trained model (logR
), we have predicted the class for each data point. However, we know that there is a probability associated with each class. Most of the time, especially in predicting the disease, we may also want to look for the probabilities for each class. These probability values will give us more control over the results, and we can even calibrate the threshold accordingly.
Recall the logistic regression theory, and our default cut-off line is the probability of 0.5 (for example, class 0 for a probability value between [0.0 to 0.49] and class 1 for a probability value ...