Model selection is a crucial stage in machine learning that focuses on choosing the best model and algorithm for a certain task. It is essential to acquire a model that provides precise and accurate results with good performance and fits well with our requirements. This ensures we get the expected outputs and correctly use the dataset to serve the real purpose.
Multiple techniques are used during the model selection process to obtain the best possible observation at each step and reach the best-suited model with low chances of inaccuracy. We can categorize these techniques into model selection phase categories.
Let’s discuss some of the techniques crucial to different process steps.
It is an exhaustive searching technique done on a set of parameter values of the model for hyperparameter tuning. A grid is defined as having hyperparameter values from which we search all the possible hyperparameter combinations that can be beneficial. It is independent of past evaluations and solely depends on the combinations that appear in the grid for the given parameters.
Note: The model on which the grid search is applied is also known estimator.
Once these combinations are identified, the models are trained and then tested for each combination. The results of all combinations, based on their performance, are compared to select the ideal hyperparameter settings to optimize the model’s performance.
It is a hyperparameter tuning technique that uses a fixed number of parameters from the specified distribution. This sampling has two types:
With replacement if any parameter(s) is given as a distribution
Without replacement if all parameters are presented as a list
It is important to understand the trade-off in this technique because the smaller the randomly selected subset of data points, the more efficient the process but the less accurate the optimization results.
It is a technique used for hyperparameter tuning which utilizes past performances to improve the search speed. It builds the probability model of the objective function to select the best-suited hyperparameter that can be evaluated in the true objective function.
We have 8 sample data points from the true objective function. Present them on a graph plot.
Build a surrogate model(the probability representation of the objective function) to get an estimated idea of how the true objective function can be and mark the deviations.
Build an acquisition function to decide on the 9th parameter and identify the point where it is maximized and use that to mark the 9th parameter in the surrogate model.
Keep repeating the steps until the true objective function is obtained.
How will we know we have the true objective function?
It is a resampling technique that creates dataset partitions for test and training data and makes predictions. One of the commonly used cross-validation techniques is k-fold, which divides the dataset into various small groups referred to as folds. Some of these folds are used as the training dataset, and the rest are reserved as the test dataset. The model is then trained and evaluated separately for each fold.
Note: The number of folds are specified e.g. in this case k = 5, where k represents folds.
This lower the variance in the evaluation, which helps to achieve more accuracy while analyzing the model's performance. Consequently, maximizing the learning and the validity of the test results that are difficult to achieve. Cross-validation can be computationally intensive because we are training and testing continuously for several subsets, but it helps to reduce the risks of overfitting and underfitting.
In this technique, we split the dataset into two sub-datasets, i.e., train and test. The purpose of splitting datasets is to check if the model performs well on the data that is not trained.
Train set: To train the model to make predictions based on its observations.
Test set: To test the model once it is trained and validated.
Validation set: To measure the performance of each model to select the best one based on the accuracy of results on this set.
Let's take a quick look at the procedure chronologically to understand what happens.
We import train_test_split
from the model_selection
module of sklearn
library to split the dataset. Then we create an instance of train_test_split
and pass the columns of the created 2D array, percentage ratios of training and test sets, and the random seed value as parameters.
from sklearn.model_selection import train_test_splitX = np.array(x).reshape(-1, 1)Y = np.array(y).reshape(-1, 1)X_train, X_val, Y_train, Y_val = train_test_split(X, Y, train_size=0.6, test_size=0.2, random_state=1)
In this technique, we compare the performances of the overall algorithm of the models based on the parameters involved. Let's briefly discuss a few development-based parameters that are compared to select a model that provides efficient production and has a longer lifetime.
It can be defined as the number of test cases that are correctly classified divided by the total test cases. It can be applied to generic problems that have a balanced dataset.
However, if the dataset is imbalanced, for example, the ratio of a fault occurrence and no-fault occurrence is 1: 99, then the accuracy will be false, showing a 99% and ignoring that 1%.
It is a measure of the correctness of the classified dataset. Considering the positive cases, we can say that it is the ratio of the correctly classified positive cases to the total classified positive cases.
The greater the fraction, the higher the precision and, consequently, the higher probability of correct classification. The model which has a good probability of correctly classifying the cases is considered a good model.
It can be defined as the harmonic mean of precision and recall as it is used to balance the strength of each in cases where precision and recall are available to drive conclusions.
For example, in repairing crucial medical equipment, precision will help to save on the company’s cost by identifying exact repairing points, and recall will help to ensure that the machinery is stable and not a threat to human lives.
It can be defined as the rate of correctly classified positive cases against incorrectly classified positive cases. We plot a ROC curve to present the relation. The area under the obtained graph can determine the model's performance.
Various techniques help select a model, such as grid searches, random searches, Bayesian optimization, cross-validation, train-test split, and model performance comparison. Using these techniques effectively makes it possible to thoroughly examine the model, tune hyperparameters, and compare the obtained results of different potential models to get the best fit.
A hyperparameter tuning technique
Cross validation
One of the development-based parameters for performance measurement.
Bayesian optimization
Uses the K-folds technique.
Recall
Free Resources