Home/Blog/Interview Prep/Top machine learning interview questions for 2024

Top machine learning interview questions for 2024

25 min read

Sep 16, 2024

content

1. What’s a systematic way to solve a machine learning problem?#

Understanding the problem: Understanding the nature of the problem and the desired objective is an important first step that will influence decisions like the metrics used for evaluation, the choice of algorithm and the choice of loss function.
Data preparation: Data preparation requires some or all of these steps:
- Collection and exploration: collecting data from different sources, merging it and gaining clarity in understanding it, such as identification of the dependent and independent variables.
- Cleaning: cleaning it up by removing duplicates, removing errors, handling outliers and missing values.
- Transforming: normalizing and scaling feature values so one feature does not dominate over the others. Discretizing continuous-valued features. Standardizing formats such as for dates and timestamps.
- Feature engineering: selecting relevant features, reducing dimensions, pivoting, and encoding.
Data splitting: Dividing the data into training, validation, and testing sets.
Training the model: The machine learning technique and the algorithm must be selected depending on the problem and the data. The data is divided into training, validation, and testing datasets. The model is trained on the training data and refined during the evaluation.
Evaluation: Using appropriate performance metrics and validation techniques, the model is validated using the validation data and assessed for flaws. Training is reiterated till the model fails to improve. The final model is evaluated on the test dataset. If it doesn’t perform well, the model may have overfitted to the validation data. Other cross-validation techniques can be considered. Hyperparameters can be fine-tuned, and regularization can be applied at this stage. For neural networks, dropping out or changing the number of layers may help. The overfitting may be because of how the data was split, and the data split may be reconsidered to start from scratch.
Deployment: The model is deployed after being assessed as successful. The deployed model is monitored continuously and evaluated. This may result in retraining or fine-tuning the model.

2. What are some common clustering algorithms, and how do they work?#

Clustering algorithms are typically used for unsupervised learning (where the learning of patterns in the data occurs without labels assigned to it). These algorithms group similar data points into groups called clusters. Some common clustering algorithms are as follows:

$\bm{k}$ -means clustering: In $k$ -means clustering, the number of clusters $k$ is chosen at the outset, and $k$ data points are randomly chosen to serve as the cluster centers. The remaining data points are assigned to the closest cluster. For the data points in each cluster, the mean of the data points is computed to see the actual center of that cluster. If the actual cluster centers differ, the entire process is repeated with these new centers. This continues till there are no more changes in the cluster centers or the changes are insignificant (the difference between the old and new centers of each cluster is below a certain threshold).
DBScan: DBScan identifies clusters based on a notion of density—the points that appear closer are grouped together. As a result of this, the clusters don’t necessarily come together as circles (or hyperspheres). There must be more points than a specified minimum for a group of points to qualify as a cluster, and points that do not fall into a cluster are considered noise. Some data points are deemed to be core points and may lie in multiple clusters.

Hierarchical clustering: This consists of two methods called agglomerative and divisive clustering, which can be considered bottom-up and top-down versions of an idea.
- Agglomerative clustering begins with each data point placed in a new cluster. At each step, two closest clusters are brought together until a terminating condition is reached.
- Divisive clustering begins with all data points placed in the same cluster. Then, at each step, a cluster is picked and split into two till a terminating condition is reached.
The terminating condition may be the number of clusters or the minimum size of a cluster.

The term “hierarchical” stems from the fact that clusters formed at each step are thought to be organized across time into a hierarchical tree-like structure. This hierarchy is visualized through a dendrogram, where the height of the horizontal lines shows the distance between the combined clusters.

Sketching a horizontal line across the dendrogram allows us to choose the clusters based on the level of granularity required.

3. How do we choose between a model with low bias and high variance vs. another model with high bias and low variance?#

Bias and variance are measures used to indicate two different kinds of error in the model.

Bias: Bias represents the gap between the predicted mean and the actual target values. A high bias indicates the model has not learned the patterns in the data well enough to make predictions, and implies an underfitting of the model to the data.
Variance: Variance represents how much we expect the predicted value to change as we vary the dataset. A high variance indicates that the model has learned noise—patterns from the data that are not present in the real-world instances. In other words, it represents an overfitting of the model to the data.

We desire models with low bias and low variance. When faced with a model that exhibits low bias and high variance vs. another that exhibits high bias and low variance, a practical alternative that works well is to use cross-validation to see which one is performing better. We could also use the mean squared error (MSE) which is a function of both measures:

\text{Mean Squared Error} = \text{Bias}^2 + \text{Variance}

The bias term is a square, as bias can be negative (it’s a square instead of an absolute value, as the variance is also a square of the standard deviation). Mean-squared error is also used as a loss function to be minimized by the learning algorithm.

4. How does PCA help achieve dimensionality reduction?#

The features in a dataset can be thought of as the coordinates of a hyperspace. PCA (principal component analysis) is a technique that involves applying an orthogonal transformationAn orthogonal transformation consists of rotatations and reflections along the axes. on the data so that the data is transformed into new coordinates. The transformation is designed to maximize the dispersion of data along the axes (or the basis) of the transformed space. These axes are called the principal components.

A consequence of the way these principal components are constructed is that the first principal component has more of the transformed data points around it (i.e., it has a larger variance along it) than the second one, which in turn has a larger variance along it than the third one, and so on. Because of this, very little information is contained around the last few principal components and these can be dropped for achieving dimensionality reduction.

5. How does logistic regression compare to SVM?#

Logistic regression is appropriate where the objective is to classify data into binary categories (Yes or No). It’s effective when the class boundaries are separable by a linear function.

On the other hand, SVM (support vector machine) is better at identifying class boundaries than logistic regression. The data points that lie closest to the class boundaries are called support vectors. SVM finds a decision boundary so that the distance of the boundary from the closest support vectors on each side is maximized. With a suitable choice of a kernel (a function), the input data can also be transformed so that the decision boundaries are non-linear. For this reason, it’s a better fit for dealing with high-dimensional data.

6. What’s the k\bm{k}k-nearest neighbors algorithm (k\bm{k}kNN)?#

$\bm{k}$ NN ( $\bm{k}$ -nearest neighbors algorithm) is a supervised classification algorithm where some of the points are labeled in the training phase. It proceeds as follows:

The distances of all labeled points from all unlabeled points are computed.
For each unlabeled point, the closest $k$ labeled data points are considered.
Each unlabeled point is classified by the label that appears on the majority of these $k$ nearest neighbors.

The choice of $k$ must be done carefully as it can lead to overfitting or underfitting.

Decision trees are primarily used for solving classification problems. Each internal node in a decision tree represents a decision or a question about a feature value. The branches coming out of an internal node represent the possible answers to that question. A leaf node corresponds to a classification category or a prediction.

To classify a data point, the tree is traversed starting at the root. At each internal node of the tree, the correct branch is taken depending on the feature value on which the decision is made. The data sample is classified once a leaf node is reached.

The choice of a variable over which a decision node is created on each level impacts the classification’s effectiveness. Multiple algorithms exist for creating a decision tree.

8. What is Gini impurity, and how is it used?#

In decision trees, a decision node can be considered representative of a subset of data that is to be split further based on some criteria. Gini impurity is a metric for measuring the class imbalance produced at a child node due to a decision or a split. It is given by the following formula:

\text{Gini impurity} = 1 - \sum_{i=1}^n{p_i^2}

Here, $p_i$ is the probability that a randomly selected data sample belongs to the $i^{th}$ class, and $n$ is the number of classes. The probabilities $p_i$ are calculated based on the class distribution of the data points at that node.

Gini impurity is useful for decision tree algorithms, where a feature and its value must be selected at each step of the process to create the next decision node. To do this effectively,

For each possible way to make a decision node, the Gini impurity of each resulting child node is calculated.
The weighted average of the Gini impurity of these child nodes is then calculated as a measure of the split’s quality.

The decision associated with the minimum weighted Gini is then used as the decision node.

To grasp this, it’s best to consider a small example. Suppose we’re given a problem with two classes and the following dataset:

The number of instances is not the same for both children, so we take a weighted average as a measure of how good the decision is:

\begin{align*} \text{Weighted Gini impurity} &= \frac{3 (.44) + 4 (.38)}{7} \\ &= 0.41 \end{align*}

We can similarly evaluate the decisions for other values of the color feature. For the decisions “color = gray?” and “color = red?”, the weighted Gini impurities are 0.37 and 0.27, respectively. So we pick “color=red?” as a decision node as it results in the split with the smallest minimum weighted Gini impurity—the least impure split. If there were other features in the dataset, we would have also considered those in selecting the least impure split.

9. What’s ensemble learning?#

Individual machine models may be prone to making erroneous predictions. Ensemble learning uses multiple weaker machine learning models to yield a model with improved predictive powers.

There are different types of ensemble learning. For example:

Boosting is an ensemble learning technique where multiple weak learners are used sequentially. The predictions from each model are validated, and the incorrect predictions are used to control how the second model is built. In this manner, subsequent models are built and tested until a termination condition is met, such as the model’s performance failing to improve or a limit on the number of models is reached. In this way, the predictions from intermediate models contribute to making a final prediction. This is done based on majority votes or averages. Some well-known boosting algorithms are XGBoost, AdaBoost, and gradient boost.
Bagging involves multiple instances of training done on randomly chosen different subsets of data in parallel. Random forests are a well-known example of bagging in which small-sized decision trees are used as models. The predictions from these models are then aggregated: the classification outcome is based on how most decision trees classify each point. For numerical data, the average of the predicted outcome is used.

10. What is a CNN, and how is it different from a traditional neural network?#

Convolutional neural networks (CNNs) can be considered a generalization of a traditional neural network in which additional preprocessing layers are added. They work particularly well on data where the local context is important, particularly for images and signals, but also for natural language processing.

The main building blocks of a CNN, for 2D data, such as an image, are as follows. For other kinds of data, such as sequential data, similar ideas are applicable:

Input layer: For an image, each pixel (in each color channel) is an input node.
Convolution layer: A window, called a kernel or a filter, is slid over each image channel and an operation called convolution is applied to it (corresponding elements are multiplied, and resulting values are summed over). This is intended to reduce information loss during the subsequent compression applied in the pooling stage. Typically, multiple filters are used in a convolution layer, and these are learned through back-propagation.

Pooling layer: The convoluted image, also called a feature map, is compressed by sliding a window over it and taking (typically) the maximum or the average of the values at each step.

Multiple convolution and pooling layers help with feature extraction and reduction of size without significant loss of information.
Flattening layer: The data is flattened into one dimension. This can be done, for instance, by aligning all the columns sequentially into a single column.
Fully connected layer: The data is fed into a traditional fully connected neural network. Each node in the output layer of the fully connected neural network represents a prediction.

The highly specialized layers that precede the fully connected layer distinguish a CNN from an ANN (Artificial Neural Network). Because of these, data can be fed into a CNN with very little preprocessing. Another interesting way a CNN differs from an ANN is that sliding a small filter over the image implies that the same weights are shared on multiple connections, resulting in a smaller number of parameters. CNNs work better than traditional neural networks on spatial data making them better suited for images, videos, time-series data (temporal data), and natural language.

11. What do sensitivity and specificity measure?#

Sensitivity and specificity are metrics that measure the fraction of predictions that are correctly classified. Both metrics are defined as follows using the notations TP and FP to indicate the number of true positives and false positives and the notations TN and FN to represent the number of true negatives and false negatives:

Sensitivity is the fraction of the positive predictions that are correct.
$\begin{align*} \text{Sensitivity} = \frac{\text{True Positives}}{\text{Actual Positives}} = \frac{\text{TP}}{\text{TP + FN}} \end{align*}$
In other words, sensitivity is the true-positive rate. Sensitivity is also called recall.
Specificity is the fraction of negative predictions that are correctly identified.
$\text{Specificity} = \frac{\text{True Negatives}}{\text{Actual Negatives}}=\frac{\text{TN}}{\text{TN+FP}}$

Note that $(1 - \text{Specificity})$ is the false-positive rate:
$1 - \text{Specificity} = 1 - \frac{\text{TN}}{\text{TN + FP}}= \frac{\text{FP}}{\text{TN + FP}} = \text{False-positive rate}$

12. What’s a confusion matrix, and how is it useful?#

A confusion matrix has rows and columns labeled by the classification categories. The row labels represent predictions and the column labels represent the actual classes (the other way round also works.)

Each cell contains the number of samples that are classified as indicated by the row label, but actually belong to the class given by the column.

In the image above, the diagonal entries represent the values that were correctly classified. The non-diagonal entries are the errors called the type I (FP) and type II (FN) errors. The matrix entries can be of interest for many other reasons:

The metrics sensitivity and specificity can be computed by dividing each diagonal entry by the corresponding column sum.
The metric accuracyAccuracy is the ratio of correct predictions to all productions can be found by dividing the sum of diagonal entries by the sum of all entries.
When the focus is on true positives, the precisionPrecision is the ratio of the number of true positive predictions to the number of positive predictions. and recallRecall is the same as sensitivity. metrics are more relevant.
Both are calculated easily by dividing TP by the row and column sum, respectively.

In short, a confusion matrix enables us to quickly calculate different metrics and helps compare two models built against different thresholds.

13. How can the ROC curve be interpreted?#

As the classification threshold for a classification algorithm changes, the number of true positives and false positives also changes. To see which threshold works best, an ROC curve (Receiver Operating Characteristic curve) is a useful way to visualize how accurately a classifier makes positive predictions as changes are made to its classification threshold.

The ROC curve is drawn with the horizontal axis representing the false-positive rate and the vertical axis representing the true-positive rate. A point $(x,y)$ on the ROC curve represents a classification threshold that resulted in $x$ false positives and $y$ true positives.

If the ROC curve is at a 45-degree angle, it implies that the classification at all thresholds is no better than a random coin flip. A ROC curve below this line indicates that it’s worse than random, and we need to reevaluate how we assign classes to the data points.

Note that the ROC curve for a classifier is monotonically increasing. In other words an increase in threshold causes a decrease in both the number of true positives and false positives, and a decrease in threshold leads to increase in the number of both true and false positives.

The point on the ROC curve closest to the top-left corner is usually a good choice of threshold when the correct and incorrect classification of the positives are considered equally important. But that may not be a good choice if the importance assigned to the true-positive and false-positives rate is skewed. For example, detecting true positives in cancer detection is more important for early intervention. A higher false positive rate may be stressful for healthy individuals identified as cancer patients, but subsequent testing can help remove that stress. In such a case, we might tilt toward choosing a threshold corresponding to a higher true-positive rate at the cost of a higher false-positive rate.

14. How is AUC for an ROC curve a useful measure?#

AUC stands for the area under the curve. It can be proved that the area under the ROC curve (for a given classifier) is the probability that the classifier will assign a bigger score to a random instance belonging to the positive class compared to a random instance from the negative class. In other words, it’s the probability of how well the classifier distinguishes the randomly appearing positive class instances from the negative ones.

Since this numeric value does not depend on a single threshold and gives information about the classifier’s overall quality, it is a good measure for comparing classifiers and far more useful than dealing with multiple confusion matrices against each threshold.

15. What are the advantages and disadvantages of one hot encoding?#

One hot encoding transforms categorical (textual) features into numerical ones. This is done by making each feature value a distinct feature with a binary 0 or 1 value. Here are the pros:

Algorithms that implement regression or use neural networks deal with numbers. One hot encoding is useful for such algorithms that require numeric data.
As the new feature values are 0 or 1, no implicit rank is assigned to the values of the corresponding feature in the original dataset. For example, if we had encoded past, present and future in the following illustration as 1, 2, 3, it may have implied one of these is more important than the other.

The encoding also has some cons:

The data becomes sparse. That’s because in each row, of the values taken up by the new features (created against a single categorical feature), one is a 1 and the rest are zero.
There’s an increase in the data dimensionality.
Each new feature introduced against the same categorical feature will be correlated.

16. What’s k\bm{k}k-fold cross-validation, and how is it useful?#

$\bm{k}$ -fold cross-validation is a technique used for evaluating the effectiveness of a model on unseen data. It’s useful for fine-tuning hyperparameters of a model and is usually used when the dataset is small.

This technique involves randomly partitioning the dataset into $k$ datasets referred to as the $k$ folds. There are $k$ rounds such that:

In each round, a new model (with the same parameters) is trained using $k-1$ of the folds as the training data and the remaining fold as the testing data.
Some performance metric is used to evaluate this model.

For each of these $k$ models, the performance metrics are averaged to assess the quality of the design choices (the algorithm used and its parameters).

17. Why are the activation functions used in a neural network nonlinear?#

In a neural network, each incoming value into a node is multiplied by a numeric weight; these weighted inputs are then summed together and added to a numeric value (called the bias). In this way, the inputs are linearly transformed. Finally, the activation function is applied to this transformed value to generate the output.

If the activation function were chosen to be linear, the effect would be to compose two linear transformations, which is once again a linear transformation. In such a case, the entire neural network would have the effect of doing nothing more than performing a linear transformation of the inputs and consequently learning only the linear relationships between the features. The activation functions are, therefore, chosen to be nonlinear to learn the more complex relationships between the different features.

18. Where is reinforcement learning used?#

Reinforcement learning occurs as a learning agent (an algorithm) interacts with its environment and is rewarded or penalized for its actions; this reward-based feedback results in increased agent proficiency in interacting with and learning from its environment.

Unlike supervised learning, where the data is the manually labeled input, the data in reinforcement learning is learned from the environment. For this reason, it’s a better fit for real-world scenarios where the environment is complex and dynamically changing. Reinforcement learning is commonly used in robotics, simulations, industrial automative processes, chatbots and gaming.

19. What’s the difference between Lasso and Ridge regression?#

For a regression problem, the assumption is that there’s a linear relationship between the dependent variable (target) and the independent variables (features) that’s expressed as:

y = \sum_i \alpha_i x_i + \epsilon

Here, the $x_i$ are the independent variables, $y$ is the target variable, and $\epsilon$ the error term. The parameters $\alpha_i$ of the linear function can be found iteratively by minimizing a loss function

\text{Loss function} = \displaystyle\sum_k (y_k - \hat{y}_k)^2

, where $y_k$ is the actual value and $\hat{y}_k$ the predicted value for the $k^{th}$ data point.

Lasso and Ridge regression are both forms of regression that use a different loss function to prevent overfitting to the training data.

In Lasso regression, a regularized term (a “penalty”) is added to the loss function of OLS regression. The penalty is a scalar multiple of the sum of absolute values of the parameters:
$\text{Loss function} + \lambda \sum_i |\alpha_i|$
Here, $\lambda$ is a scalar that controls the degree to which regularization affects the loss function. Having this term introduces bias and, therefore, reduces overfitting.

An artifact of this loss function is that it causes some of the parameters $\alpha_i$ to be reduced to zero. A parameter $\alpha_i$ that’s zero implies an automatic removal of the values taken up by the corresponding feature $x_i$ .

Note: This built-in dimension reduction may not always be desirable as the removed features may strongly correlate with the target.
On the other hand, Ridge regression uses a regularized term that is a sum of the squares of the parameters. As a result, the larger parameters contribute more to the loss function for Ridge regression than to the loss function for Lasso regression.
$\text{Loss function} + \lambda \sum_i \alpha_i^2$
Ridge regression, like Lasso regression, prevents overfitting. It prioritizes large-valued parameters more than Lasso regression and works well. However, the coefficients of Ridge regression never become zero (the math is beyond the scope of this blog), and as a result, there’s no feature selection.

20. What’s the difference between batch gradient descent and stochastic gradient descent?#

Both are variations of the gradient descent technique used to minimize a loss function for calibrating the weights and biases in linear regression as well as in neural networks. The difference is that batch gradient descent is applied to the entire dataset each time the loss function is minimized. On the other hand, in stochastic gradient descent, gradient descent is applied to a few randomly sampled data points at each step. The random sampling explains why it’s called “stochastic”.

As a consequence, in stochastic gradient descent:

Each step is efficient as it deals with a small data sample.
The result of the fluctuations in the weights and biases at each step implies a less stable convergence.
There’s a greater chance of getting out, when stuck in a local minima.

On the other hand, in batch gradient:

The process can be time-consuming.
The progress and convergence are both steady.
The algorithm can get stuck in a local minima (when the objective function is non-convex).