Solution: Model Evaluation
Follow the instructions to perform model evaluation on real-world data.
We'll cover the following...
There are multiple possible solutions for the model selection coding challenge, depending on the cross-validation methods we choose, but the important thing is to do the following:
Choose an appropriate metric for a classification task.
Use a cross-validation method to select the best model.
Here is one possible solution:
Press + to interact
main.py
data.csv
import numpy as npimport pandas as pdpreprocessed = pd.read_csv("preprocessed.csv")# Define X (model features) and y (target variable)X = preprocessed[X_var]y = preprocessed[y_var]# Three algorithmsclassifiers = [LogisticRegression(penalty='l2', C=10),KNeighborsClassifier(n_neighbors=4, metric='euclidean', weights='distance'),DecisionTreeClassifier(max_depth=5, min_samples_split=10)]# Import evaluation metricfrom sklearn.metrics import f1_score# Initialize k-fold cross-validationfrom sklearn.model_selection import KFoldk = 3kf = KFold(n_splits=k)# Perform k-fold cross-validation for each modelfor model in classifiers:# Initialize a list to store the F1 scores for each foldf1_scores = []for train_index, test_index in kf.split(X):X_train, X_test = X.iloc[train_index], X.iloc[test_index]y_train, y_test = y.iloc[train_index], y.iloc[test_index]# Train the modelmodel.fit(X_train, y_train)# Calculate F1 score for the current foldy_test_pred = model.predict(X_test)f1_scores.append(f1_score(y_test, y_test_pred))print(f"Average F1 Score for {type(model).__name__}:", np.mean(f1_scores))
Lines 10–17: We initialize three different classification ...