How to implement cross_val_score in sklearn

Scikit-learn is a Python open-source machine-learning library providing various tools for data preprocessing, modeling, and evaluation. The most essential phase of creating a robust machine learning model is to effectively evaluate its accuracy score. We can evaluate our model's accuracy on multiple data points using Scikit-learn's  cross_val_score function. In this Answer, we'll explore cross_val_score and its step-by-step implementation.

Understanding cross_val_score

cross_val_score is a function that generates a cross-validated accuracy score for each data point of our dataset. It splits the data set into multiple subsets of training and testing data, trains the model on each training subset, performs predictions on the testing subset, and outputs the prediction accuracy score for each subset. The process repeats depending on the number of cross-validations we have set.

The cross_val_score function evaluates the model's performance on each data point, providing a better understanding of the model's behavior and weaknesses. In the illustration below, we can see how the cross_val_score splits the data set into training and testing data if the number of cross-validations is set to 5.

5 train test splits of the same data using cross_val_score
5 train test splits of the same data using cross_val_score

Syntax

The syntax to use cross_val_score is:

cross_val_score(estimator , X , y , groups, scoring, cv , n_jobs , verbose , fit_params , pre_dispatch , error_score)
cross_val_score syntax
  • estimator: The object that implements ‘fit’ and ‘predict’.

  • X: The features data array to fit.

  • y: The target array for prediction and training. Default = None.

  • groups: An array of group identifiers used in combination with a group-based technique (e.g., GroupKFold). It is used for the sample dataset while dividing it into training and testing set.

  • scoring: By default, its value is None. In this case, the default scorer of the estimator is used. Otherwise, we can pass a string that tells which scoring option ( accuracy, precision, recall, f1, roc_auc, etc) to use.

  • cv: An integer value that determines the number of iterations in which the train-test splits are to be made.

  • n_jobs: It is the total number of jobs we want to execute in parallel. None means 1 the context is not a joblib.parallel_backend-1 means that we want all processors to be used.

  • verbose: It sets the verbosityExplaining the steps in words. level. Default = 0.

  • fit_params: A dictionary of parameters to be passed to the estimator's fit method.

  • pre_dispatch: By default, its value is 2*n_jobs. It manages the amount of dispatched jobs while parallel execution. By decreasing this quantity, we can prevent excessive memory usage caused by dispatching more tasks than the available CPUs can handle.

  • error_score: By default, its value is np.nan. It is the value that is assigned to the score if there is an error in estimator fitting. If its value is set to raise, the error is raised. FitFailedWarning is raised when a numeric value is set.

Steps to implement cross_val_score

Now that we have had a clear understanding of cross_val_score, we will walk through the steps for its implementation:

1. Import the necessary libraries

Before we can use cross_val_score, we need to import the required libraries from sklearn:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
import numpy as np
import pandas as pd
Importing necessary score

We have imported cross_val_score from sklearn's modelselection module. We will be using Ridge Regression in this example.

2. Load and prepare the data

Now we will import the data on which we want to apply our machine learning model. For that, we will import the California housing data set from Sklearn.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
housing_sk_data = fetch_california_housing()
housing_df = pd.DataFrame(housing_sk_data["data"] , columns=housing_sk_data["feature_names"])
housing_df["target"]=housing_sk_data["target"]
x = housing_df.drop("target" ,axis=1)
y = housing_df["target"]
Loading and preparing the data

After importing the data, we prepare our feature matrix (x) and target vector (y).

3. Create an estimator

We will instantiate the machine learning model we want to use. As said earlier, we'll use a RidgeRegression model:

model = RidgeRegression()
Creating an estimator

5. Generate cross-validated accuracy score

Now, we can use the cross_val_score function to generate cross-validated predictions and accuracy scores:

cross_val_score = cross_val_score(model, x, y , cv = 5)
Performing cross validation accuracy score prediction

We set the cross validation/iterable (cv) to 5 which means that the model will be trained and tested on 5 different subsets of the dataset.

6. Analyze the predictions

Now that we have successfully trained and tested our data using the cross_val_score, we can analyze the output of the function, which is the accuracy score across each data point. It helps to understand the model's performance better, for instance, we can identify data points where the model consistently performs well or poorly.

Complete code

The complete code can be seen and executed by clicking the Run buton below:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

housing_sk_data = fetch_california_housing()
housing_df = pd.DataFrame(housing_sk_data["data"] , columns=housing_sk_data["feature_names"])
housing_df["target"]=housing_sk_data["target"]

x = housing_df.drop("target" ,axis=1)
y = housing_df["target"]

model = Ridge()

cross_val_scores = cross_val_score(model, x, y , cv = 5)

print(cross_val_scores)
Complete code to perform cross_val_score

Benefits of cross_val_score

cross_val_score offers several advantages:

  • Insight into model performance: By obtaining accuracy score for each data point, we can analyze our model in detail to explore where it works fine and where it struggles.

  • Data efficiency: It ensures data efficiency as each data point is utilized for training and testing which maximizes the dataset's use.

  • Effective evaluation: We can assess the model's performance more accurately compared to a single train-test split.

Conclusion

The cross_val_score function provided by Sklearn is a powerful tool for evaluating machine learning models by providing cross-validated accuracy scores. By following the steps explained in this Answer, we can implement cross_val_score. This enables us to gain insights on the model's behavior across different data subsets and so that we may improve it.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved