...
/Dummy Estimators and Handling Imbalance Class Problem
Dummy Estimators and Handling Imbalance Class Problem
You will learn about Dummy Estimators and handling imbalance class problems in this lesson. Dummy estimators help develop baseline models for classification. The Imbalanced class problem is a common problem, and there are several techniques to deal with it.
We'll cover the following...
Dummy Estimators
Dummy Estimators help us to define a baseline model on the problem at hand. We saw them in case of Regression problems too. In the case of Classification, we have the following Dummy Estimators.
-
stratified: It predicts the random class label by respecting the training set class distribution.
-
most_frequent: It always predicts the most common label in the training dataset.
-
prior: It predicts the class which maximizes the class prior.
-
uniform: It generates the predictions uniformly at random.
-
constant: It always predicts the constant label provided by the user.
prior always predicts the class that maximizes the class prior (like most_frequent) and predict_proba
returns the class prior.
from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitX, y = load_iris(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.7)# Fitting the BaseLine DummyEstimatorfrom sklearn.dummy import DummyClassifierclf = DummyClassifier(strategy='most_frequent', random_state=0)clf.fit(X_train, y_train)print("The accuracy (DummyClassifier) on test set is {0:.2f}".format(clf.score(X_test, y_test)))# Fitting the Support Vector Machinefrom sklearn.svm import SVCclf = SVC(kernel='linear', C=1).fit(X_train, y_train)print("The accuracy (SVM) on test set is {0:.2f}".format(clf.score(X_test, y_test)))
-
On Line
1-2
we load the necessary modules. On Line3
we load the Iris dataset. On Line4
we divide the Iris Dataset into training and test dataset. Note thattrain_size=0.7
, indicates to include 70% of the rows in the training dataset and 30% in the test dataset. -
On Line
9
...