Solution: Retrain the Model Without Pclass Dummies
View the solution to the “Retrain the Model Without Pclass Dummies” exercise.
We'll cover the following...
Let's practice what we have learned so far.
Task
The task here is to retrain the model with the original Pclass
column without dummies and obtain the model results.
Solution
The provided workspace contains the code solution for the task mentioned above.
Press + to interact
# loading the train datatrain = pd.read_csv("train_titanic_Xy.csv")# pre-processingdef impute_age(age_pclass):Age = age_pclass[0]Pclass = age_pclass[1]if pd.isnull(Age):if Pclass == 1:return 38elif Pclass == 2:return 30else:return 25else:return Agetrain["Age"] = train[["Age", "Pclass"]].apply(impute_age, axis=1)train.drop("Cabin", axis=1, inplace=True)train.dropna(inplace=True)# creating dummiessex_emb = pd.get_dummies(train[["Sex", "Embarked"]], drop_first=True)train = pd.concat([train, sex_emb], axis=1)# get rid of extraneous featurestrain.drop(["Sex", "Embarked", "Name", "Ticket", "PassengerId"], inplace=True, axis=1)# training dataX_train = train.drop("Survived", axis=1)y_train = train["Survived"]# scaling the datascaler = MinMaxScaler()scaler.fit(X_train)X_train_s = scaler.transform(X_train)# model ==> traininglogR = LogisticRegression(penalty="l2", max_iter=10000, C=0.5, n_jobs=-1)logR.fit(X_train_s, y_train)# loading the test dataX_test = pd.read_csv("test_titanic_X.csv")y_test = pd.read_csv("test_titanic_y.csv")test = pd.merge(X_test, y_test, on="PassengerId", how="left")# pre-processingtest["Age"] = test[["Age", "Pclass"]].apply(impute_age, axis=1)test.drop("Cabin", axis=1, inplace=True)test.dropna(inplace=True)# repeatsex_emb_test = pd.get_dummies(test[["Sex", "Embarked"]], drop_first=True)test = pd.concat([test, sex_emb_test], axis=1)test.drop(["Sex", "Embarked", "Name", "Ticket", "PassengerId"], axis=1, inplace=True)# testing dataX_test = test.drop("Survived", axis=1)y_test = test["Survived"]# scaling the dataX_test_s = scaler.transform(X_test)X_test_s = pd.DataFrame(X_test_s, columns=X_test.columns)# predictionspred_train = logR.predict(X_train_s)pred_test = logR.predict(X_test_s)# displayprint("Training score (mean accuracy): ", logR.score(X_train_s, y_train))print("Test Score (mean accuracy): ", logR.score(X_test_s, y_test))
...