...

/

Advanced Cross-Validation

Advanced Cross-Validation

Learn more advanced methods of cross-validation.

Advanced cross-validation techniques, such as k-fold and leave-one-out, provide more robust and accurate assessments of model performance in ML. These methods go beyond the basic train-test split and allow for a more comprehensive evaluation of model generalization.

The k-fold cross-validation technique

The k-fold cross-validation technique involves dividing the original dataset into k equally sized subsets or folds. The model is trained and evaluated k times, each time using a different fold as the test set and the remaining folds as the training set. The performance metrics obtained from each fold are then averaged to obtain an overall assessment of the model’s performance.

Press + to interact
We start with our standard train/test split.
1 / 5
We start with our standard train/test split.

For example, let’s consider using a 5-fold cross-validation with scikit-learn:

Press + to interact
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(1000, 10) # Independent variables
important_features = [0, 1, 2, 3] # Indices of important features
y = np.sum(X[:, important_features], axis=1) + 0.5 * np.random.randn(1000) # Dependent variable
# Initialize k-fold cross-validation
k = 5
kf = KFold(n_splits=k)
# Initialize a list to store the R2 scores for each fold
r2_scores = []
# Perform k-fold cross-validation
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train the Ridge regression model
model = Ridge(alpha=0) # Alpha controls regularization strength
model.fit(X_train, y_train)
# Calculate R2 score for the current fold
y_test_pred = model.predict(X_test)
r2_scores.append(r2_score(y_test, y_test_pred))
# Print the R2 scores for each fold and their average
for i, score in enumerate(r2_scores):
print(f"R2 Score - Fold {i+1}: {score}")
print("Average R2 Score:", np.mean(r2_scores))
  • Lines 13–14: We initialize a 5-fold cross-validation.

  • Lines 20–30: We iterate over the splits so that each time we fit our model to a different training set and evaluate it on a different test set. We then store the evaluation metrics under r2_scores.

In this example, the dataset is split into five folds. The model is trained and evaluated five times, with each fold serving as the test set once. The ...

Access this course and 1400+ top-rated courses and projects.