When preparing for a machine learning interview, many technical and nontechnical aspects must be considered. To cover the technical aspects of a machine learning interview, we need to review the fundamentals and consider how machine learning is applied in practice.
In this blog, we provide the top 20 machine learning questions (with answers). These questions were selected to help you practice your ML skills in a real-world interview setting, and to gauge your depth of understanding of fundamental machine learning concepts.
Note: these interview questions are intended for entry-level positions. For more questions asked from a machine learning system design perspective and questions about the decision-making required for actual use cases, we recommend checking out the machine learning resources included at the end of this blog.
Understanding the problem: Understanding the nature of the problem and the desired objective is an important first step that will influence decisions like the metrics used for evaluation, the choice of algorithm and the choice of loss function.
Data preparation: Data preparation requires some or all of these steps:
Data splitting: Dividing the data into training, validation, and testing sets.
Training the model: The machine learning technique and the algorithm must be selected depending on the problem and the data. The data is divided into training, validation, and testing datasets. The model is trained on the training data and refined during the evaluation.
Evaluation: Using appropriate performance metrics and validation techniques, the model is validated using the validation data and assessed for flaws. Training is reiterated till the model fails to improve. The final model is evaluated on the test dataset. If it doesn’t perform well, the model may have overfitted to the validation data. Other cross-validation techniques can be considered. Hyperparameters can be fine-tuned, and regularization can be applied at this stage. For neural networks, dropping out or changing the number of layers may help. The overfitting may be because of how the data was split, and the data split may be reconsidered to start from scratch.
Deployment: The model is deployed after being assessed as successful. The deployed model is monitored continuously and evaluated. This may result in retraining or fine-tuning the model.
Clustering algorithms are typically used for unsupervised learning (where the learning of patterns in the data occurs without labels assigned to it). These algorithms group similar data points into groups called clusters. Some common clustering algorithms are as follows:
-means clustering: In -means clustering, the number of clusters is chosen at the outset, and data points are randomly chosen to serve as the cluster centers. The remaining data points are assigned to the closest cluster. For the data points in each cluster, the mean of the data points is computed to see the actual center of that cluster. If the actual cluster centers differ, the entire process is repeated with these new centers. This continues till there are no more changes in the cluster centers or the changes are insignificant (the difference between the old and new centers of each cluster is below a certain threshold).
DBScan: DBScan identifies clusters based on a notion of density—the points that appear closer are grouped together. As a result of this, the clusters don’t necessarily come together as circles (or hyperspheres). There must be more points than a specified minimum for a group of points to qualify as a cluster, and points that do not fall into a cluster are considered noise. Some data points are deemed to be core points and may lie in multiple clusters.
Hierarchical clustering: This consists of two methods called agglomerative and divisive clustering, which can be considered bottom-up and top-down versions of an idea.
The terminating condition may be the number of clusters or the minimum size of a cluster.
The term “hierarchical” stems from the fact that clusters formed at each step are thought to be organized across time into a hierarchical tree-like structure. This hierarchy is visualized through a dendrogram, where the height of the horizontal lines shows the distance between the combined clusters.
Sketching a horizontal line across the dendrogram allows us to choose the clusters based on the level of granularity required.
Note: As opposed to
-means clustering, the number of clusters doesn't need to be specified for either DBScan or hierarchical clustering.
Bias and variance are measures used to indicate two different kinds of error in the model.
Bias: Bias represents the gap between the predicted mean and the actual target values. A high bias indicates the model has not learned the patterns in the data well enough to make predictions, and implies an underfitting of the model to the data.
Variance: Variance represents how much we expect the predicted value to change as we vary the dataset. A high variance indicates that the model has learned noise—patterns from the data that are not present in the real-world instances. In other words, it represents an overfitting of the model to the data.
We desire models with low bias and low variance. When faced with a model that exhibits low bias and high variance vs. another that exhibits high bias and low variance, a practical alternative that works well is to use cross-validation to see which one is performing better. We could also use the mean squared error (MSE) which is a function of both measures:
The bias term is a square, as bias can be negative (it’s a square instead of an absolute value, as the variance is also a square of the standard deviation). Mean-squared error is also used as a loss function to be minimized by the learning algorithm.
The features in a dataset can be thought of as the coordinates of a hyperspace. PCA (principal component analysis) is a technique that involves applying an
A consequence of the way these principal components are constructed is that the first principal component has more of the transformed data points around it (i.e., it has a larger variance along it) than the second one, which in turn has a larger variance along it than the third one, and so on. Because of this, very little information is contained around the last few principal components and these can be dropped for achieving dimensionality reduction.
Logistic regression is appropriate where the objective is to classify data into binary categories (Yes or No). It’s effective when the class boundaries are separable by a linear function.
On the other hand, SVM (support vector machine) is better at identifying class boundaries than logistic regression. The data points that lie closest to the class boundaries are called support vectors. SVM finds a decision boundary so that the distance of the boundary from the closest support vectors on each side is maximized. With a suitable choice of a kernel (a function), the input data can also be transformed so that the decision boundaries are non-linear. For this reason, it’s a better fit for dealing with high-dimensional data.
NN (-nearest neighbors algorithm) is a supervised classification algorithm where some of the points are labeled in the training phase. It proceeds as follows:
The distances of all labeled points from all unlabeled points are computed.
For each unlabeled point, the closest labeled data points are considered.
Each unlabeled point is classified by the label that appears on the majority of these nearest neighbors.
The choice of must be done carefully as it can lead to overfitting or underfitting.
Decision trees are primarily used for solving classification problems. Each internal node in a decision tree represents a decision or a question about a feature value. The branches coming out of an internal node represent the possible answers to that question. A leaf node corresponds to a classification category or a prediction.
To classify a data point, the tree is traversed starting at the root. At each internal node of the tree, the correct branch is taken depending on the feature value on which the decision is made. The data sample is classified once a leaf node is reached.
The choice of a variable over which a decision node is created on each level impacts the classification’s effectiveness. Multiple algorithms exist for creating a decision tree.
In decision trees, a decision node can be considered representative of a subset of data that is to be split further based on some criteria. Gini impurity is a metric for measuring the class imbalance produced at a child node due to a decision or a split. It is given by the following formula:
Here, is the probability that a randomly selected data sample belongs to the class, and is the number of classes. The probabilities are calculated based on the class distribution of the data points at that node.
Gini impurity is useful for decision tree algorithms, where a feature and its value must be selected at each step of the process to create the next decision node. To do this effectively,
The decision associated with the minimum weighted Gini is then used as the decision node.
To grasp this, it’s best to consider a small example. Suppose we’re given a problem with two classes and the following dataset:
Color | Class |
blue | yes |
red | yes |
blue | yes |
blue | no |
red | yes |
gray | yes |
gray | no |
If we are considering a decision node for the decision: “color = blue?”, the Gini impurity for the left child is calculated by looking at the rows where the color value is blue:
For the right child, we consider the rows that have the color values red and gray:
The number of instances is not the same for both children, so we take a weighted average as a measure of how good the decision is:
We can similarly evaluate the decisions for other values of the color feature. For the decisions “color = gray?” and “color = red?”, the weighted Gini impurities are 0.37 and 0.27, respectively. So we pick “color=red?” as a decision node as it results in the split with the smallest minimum weighted Gini impurity—the least impure split. If there were other features in the dataset, we would have also considered those in selecting the least impure split.
Individual machine models may be prone to making erroneous predictions. Ensemble learning uses multiple weaker machine learning models to yield a model with improved predictive powers.
There are different types of ensemble learning. For example:
Boosting is an ensemble learning technique where multiple weak learners are used sequentially. The predictions from each model are validated, and the incorrect predictions are used to control how the second model is built. In this manner, subsequent models are built and tested until a termination condition is met, such as the model’s performance failing to improve or a limit on the number of models is reached. In this way, the predictions from intermediate models contribute to making a final prediction. This is done based on majority votes or averages. Some well-known boosting algorithms are XGBoost, AdaBoost, and gradient boost.
Bagging involves multiple instances of training done on randomly chosen different subsets of data in parallel. Random forests are a well-known example of bagging in which small-sized decision trees are used as models. The predictions from these models are then aggregated: the classification outcome is based on how most decision trees classify each point. For numerical data, the average of the predicted outcome is used.
Convolutional neural networks (CNNs) can be considered a generalization of a traditional neural network in which additional preprocessing layers are added. They work particularly well on data where the local context is important, particularly for images and signals, but also for natural language processing.
The main building blocks of a CNN, for 2D data, such as an image, are as follows. For other kinds of data, such as sequential data, similar ideas are applicable:
Pooling layer: The convoluted image, also called a feature map, is compressed by sliding a window over it and taking (typically) the maximum or the average of the values at each step.
Multiple convolution and pooling layers help with feature extraction and reduction of size without significant loss of information.
Flattening layer: The data is flattened into one dimension. This can be done, for instance, by aligning all the columns sequentially into a single column.
Fully connected layer: The data is fed into a traditional fully connected neural network. Each node in the output layer of the fully connected neural network represents a prediction.
The highly specialized layers that precede the fully connected layer distinguish a CNN from an ANN (Artificial Neural Network). Because of these, data can be fed into a CNN with very little preprocessing. Another interesting way a CNN differs from an ANN is that sliding a small filter over the image implies that the same weights are shared on multiple connections, resulting in a smaller number of parameters. CNNs work better than traditional neural networks on spatial data making them better suited for images, videos, time-series data (temporal data), and natural language.
Sensitivity and specificity are metrics that measure the fraction of predictions that are correctly classified. Both metrics are defined as follows using the notations TP and FP to indicate the number of true positives and false positives and the notations TN and FN to represent the number of true negatives and false negatives:
Sensitivity is the fraction of the positive predictions that are correct.
In other words, sensitivity is the true-positive rate. Sensitivity is also called recall.
Specificity is the fraction of negative predictions that are correctly identified.
Note that is the false-positive rate:
A confusion matrix has rows and columns labeled by the classification categories. The row labels represent predictions and the column labels represent the actual classes (the other way round also works.)
Each cell contains the number of samples that are classified as indicated by the row label, but actually belong to the class given by the column.
In the image above, the diagonal entries represent the values that were correctly classified. The non-diagonal entries are the errors called the type I (FP) and type II (FN) errors. The matrix entries can be of interest for many other reasons:
In short, a confusion matrix enables us to quickly calculate different metrics and helps compare two models built against different thresholds.
As the classification threshold for a classification algorithm changes, the number of true positives and false positives also changes. To see which threshold works best, an ROC curve (Receiver Operating Characteristic curve) is a useful way to visualize how accurately a classifier makes positive predictions as changes are made to its classification threshold.
The ROC curve is drawn with the horizontal axis representing the false-positive rate and the vertical axis representing the true-positive rate. A point on the ROC curve represents a classification threshold that resulted in false positives and true positives.
If the ROC curve is at a 45-degree angle, it implies that the classification at all thresholds is no better than a random coin flip. A ROC curve below this line indicates that it’s worse than random, and we need to reevaluate how we assign classes to the data points.
Note that the ROC curve for a classifier is monotonically increasing. In other words an increase in threshold causes a decrease in both the number of true positives and false positives, and a decrease in threshold leads to increase in the number of both true and false positives.
The point on the ROC curve closest to the top-left corner is usually a good choice of threshold when the correct and incorrect classification of the positives are considered equally important. But that may not be a good choice if the importance assigned to the true-positive and false-positives rate is skewed. For example, detecting true positives in cancer detection is more important for early intervention. A higher false positive rate may be stressful for healthy individuals identified as cancer patients, but subsequent testing can help remove that stress. In such a case, we might tilt toward choosing a threshold corresponding to a higher true-positive rate at the cost of a higher false-positive rate.
AUC stands for the area under the curve. It can be proved that the area under the ROC curve (for a given classifier) is the probability that the classifier will assign a bigger score to a random instance belonging to the positive class compared to a random instance from the negative class. In other words, it’s the probability of how well the classifier distinguishes the randomly appearing positive class instances from the negative ones.
Since this numeric value does not depend on a single threshold and gives information about the classifier’s overall quality, it is a good measure for comparing classifiers and far more useful than dealing with multiple confusion matrices against each threshold.
One hot encoding transforms categorical (textual) features into numerical ones. This is done by making each feature value a distinct feature with a binary 0 or 1 value. Here are the pros:
The encoding also has some cons:
-fold cross-validation is a technique used for evaluating the effectiveness of a model on unseen data. It’s useful for fine-tuning hyperparameters of a model and is usually used when the dataset is small.
This technique involves randomly partitioning the dataset into datasets referred to as the folds. There are rounds such that:
For each of these models, the performance metrics are averaged to assess the quality of the design choices (the algorithm used and its parameters).
In a neural network, each incoming value into a node is multiplied by a numeric weight; these weighted inputs are then summed together and added to a numeric value (called the bias). In this way, the inputs are linearly transformed. Finally, the activation function is applied to this transformed value to generate the output.
If the activation function were chosen to be linear, the effect would be to compose two linear transformations, which is once again a linear transformation. In such a case, the entire neural network would have the effect of doing nothing more than performing a linear transformation of the inputs and consequently learning only the linear relationships between the features. The activation functions are, therefore, chosen to be nonlinear to learn the more complex relationships between the different features.
Reinforcement learning occurs as a learning agent (an algorithm) interacts with its environment and is rewarded or penalized for its actions; this reward-based feedback results in increased agent proficiency in interacting with and learning from its environment.
Unlike supervised learning, where the data is the manually labeled input, the data in reinforcement learning is learned from the environment. For this reason, it’s a better fit for real-world scenarios where the environment is complex and dynamically changing. Reinforcement learning is commonly used in robotics, simulations, industrial automative processes, chatbots and gaming.
For a regression problem, the assumption is that there’s a linear relationship between the dependent variable (target) and the independent variables (features) that’s expressed as:
Here, the are the independent variables, is the target variable, and the error term. The parameters of the linear function can be found iteratively by minimizing a loss function
, where is the actual value and the predicted value for the data point.
Lasso and Ridge regression are both forms of regression that use a different loss function to prevent overfitting to the training data.
In Lasso regression, a regularized term (a “penalty”) is added to the loss function of OLS regression. The penalty is a scalar multiple of the sum of absolute values of the parameters:
Here, is a scalar that controls the degree to which regularization affects the loss function. Having this term introduces bias and, therefore, reduces overfitting.
An artifact of this loss function is that it causes some of the parameters to be reduced to zero. A parameter that’s zero implies an automatic removal of the values taken up by the corresponding feature .
Note: This built-in dimension reduction may not always be desirable as the removed features may strongly correlate with the target.
On the other hand, Ridge regression uses a regularized term that is a sum of the squares of the parameters. As a result, the larger parameters contribute more to the loss function for Ridge regression than to the loss function for Lasso regression.
Ridge regression, like Lasso regression, prevents overfitting. It prioritizes large-valued parameters more than Lasso regression and works well. However, the coefficients of Ridge regression never become zero (the math is beyond the scope of this blog), and as a result, there’s no feature selection.
Both are variations of the gradient descent technique used to minimize a loss function for calibrating the weights and biases in linear regression as well as in neural networks. The difference is that batch gradient descent is applied to the entire dataset each time the loss function is minimized. On the other hand, in stochastic gradient descent, gradient descent is applied to a few randomly sampled data points at each step. The random sampling explains why it’s called “stochastic”.
As a consequence, in stochastic gradient descent:
On the other hand, in batch gradient:
For more machine learning interview questions, check out Educative’s popular Grokking the Machine Learning Interview course, as well as some other some other hands-on machine learning courses below.
We hope this blog was helpful. Good luck with your interview!
Free Resources