Machine learning is a field of artificial intelligence that empower computers to learn patterns from data without being explicitly programmed and thus make predictions. Machine learning algorithms/estimators/models identify insights and trends in the data by iteratively processing it, which helps refine its performance and predictions.
When we come across machine learning tasks, the first and foremost is selecting an appropriate estimator. There are a variety of estimators available, like decision trees, support vector machines, neural networks, and ensemble methods. Choosing the right estimator depends on many factors, including the data size, feature complexity, and problem nature. The major problems we encounter as machine learning tasks are:
Scikit-learn's documentation has provided a complete flow chart for choosing an estimator for a machine learning task. It contains some questions related to the data and the nature of the problem that ultimately enables us to find the right estimator for our task. The scikit-learn machine learning model cheat sheet is given below:
The dataset should have greater than 50 samples, and the data should be labeled. Further, if the sample data is more than 100K entries, then we may choose the SGD classifier
. We may move towards kernel approximation in the case if the SGD classifier
does not return any satisfactory results.
For data that is less than 100K samples, Linear SVC
can do the classification job, but if we have textual data, Linear SVC
may not give us the required accuracy, and we may choose Naive Baiyes as our estimator. If the data is not textual, then kNeighbours
is the best option.
The data set should have greater than 50 samples, and the machine learning task should be to predict a quantity. For sample data that is greater than 100K entries, the SGD Regressor
is the right estimator.
On the other hand, if the data is less than 100K entries and few features have a major impact on the predictions, then Lasso
or ElasticNet
estimators are used. Otherwise Ridge Regression
is used. If Ridge Regression
does not predict accurately, then we may use ensemble methods.
For machine learning problems that require classifying data into categories and the data set does not contain labeled data, then we move towards using clustering techniques to solve the problem.
If the number of categories is known, and the data sample is less than 10K entries, then we choose the KMeans
estimator. Special Clustering
and GMM
(Gaussian mixture model) can be used if KMeans
is not giving the desired output.
If the data sample is greater than 10K entries, then MeanShift
and YBGMM
models can be trained.
In case we do not want to predict a category or a quantity, then we are moving toward the dimensionality reduction category and use the Randomized PCA
estimator.
We may check the data set size if Randomized PCA
does not work. For sample data less than 10K samples, we may train Isomap
, Spectral Embedding
, and LLE
estimators.
So to conclude, we have explored the scikit-learn cheat sheet, an invaluable resource for choosing the right estimators. The comprehensive flow chart gives a detailed and easy understanding and simplifies how we can easily select an estimator by answering a few questions.
Free Resources