Understanding maximum likelihood estimation in machine learning

Estimation is the process of making an educated guess about a population parameter based on a sample from that population. This is a crucial step because collecting data from the entire population and inferences is often impractical.

In machine learning, we use a model to represent real-world phenomena. These models are based on parameters that influence our prediction because the reliability of our predictions would only be as good as the parameters that govern the model.

This blog will discuss the importance of parameter estimation in machine learning. We will specifically look into maximum likelihood estimation and its practical implementation in Python in estimating the parameters of a machine learning regression model.

What is maximum likelihood estimation (MLE)?#

Maximum likelihood estimation (MLE) is a statistical approach that determines the models’ parameters in machine learning. The idea is to find the values of the model parameters that maximize the likelihood of observed data such that the observed data is most probable.

Let’s look at an example to understand MLE better. Assume that we want to estimate the average height of a city’s population. However, because of the sheer size of the population, we cannot calculate the true average height of the population. So, we estimate the average height as follows:

Defining a statistical model: We start by assuming that the height of the population follows a normal distribution. This implies that few people have a shorter or taller height than average.
Collecting the sample: We then collect a sample of heights from the population and find the average height based on that sample.
Calculating the likelihood function: Given the population’s average height, we look at the likelihood of observing heights. The likelihood function represents the probability of observing the provided data given the parameters in our model. In our case, the model’s parameters are the normal distribution’s mean and standard deviation. Due to computational reasons, the log-likelihood function is often used instead of the likelihood function.
Maximizing the likelihood function: MLE aims to find the average height that maximizes the log-likelihood function of obtaining the observed sample and makes the observed heights most probable.

We can now model the average height with a normal distribution whose parameters are selected by maximizing the likelihood function.

Importance of MLE in machine learning#

In supervised machine learning, we use labeled data that trains the model’s parameters. The training data consists of input features and the corresponding output labels. During the training phase, we aim to find the model parameters that best capture the patterns in the labeled data.

MLE helps fine-tune the machine learning models. In the training phase, we adjust the model’s parameters to maximize the likelihood of the labeled data. Alternatively, we can use a negative log-likelihood that represents the loss function. A loss function quantifies the difference between predicted and actual values and is defined as follows:

Practical implementation with Python#

Let’s take an example of a machine learning regression model where we want to estimate the model’s parameters that maximize the likelihood of observing the given data using MLE.

Defining the model and collecting the sample#

Let’s take an example of a binary classification problem with two possible classes. The logistic regression model is a commonly used algorithm for a binary classification problem. The model’s parameters in logistic regression include coefficients associated with each feature and an intercept term. The intercept term is an additional parameter included in the model that represents the log-odds of the event being true when all the features are zero.

We’ll start by generating a synthetic dataset for a binary classification problem with a single feature as follows:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate synthetic dataset for binary classification
X, y = make_classification(n_samples=100, n_features=1, n_informative=1, 
                            n_redundant=0, n_clusters_per_class=1)
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

Lines 1–2: We import the make_classification and train_test_split methods.
Lines 5–6: We create a synthetic dataset with the feature X and the target y using the make_classification function. The dataset consists of n_samples=100 the samples, n_features=1 feature, and the n_informative=1 informative feature, resulting in a binary classification problem.
Lines 9–11: We split X and y into training and test datasets using the train_test_split method. We use the training dataset to train the model and the test to evaluate the model’s performance.

Calculating the likelihood function#

We will now calculate the negative log-likelihood for logistic regression. The likelihood function models the probability of observing the given binary outcomes (0 or 1) given the input features. We will calculate the negative log-likelihood during optimization and minimize this function, which is equivalent to maximizing the likelihood.

Mathematically, the negative log-likelihood in logistic regression can be written as follows:

# Use MLE to estimate logistic regression parameters
initial_theta = np.zeros(X_train.shape[1] + 1)  # Initial guess for coefficients (including intercept)
X_train_with_intercept = np.c_[np.ones(X_train.shape[0]), X_train]  # Add intercept to features
result = minimize(neg_log_likelihood, initial_theta, args=(X_train_with_intercept, y_train), method='BFGS')
estimated_theta = result.x
# Plot the logistic regression curve
x_values = np.linspace(min(X_train), max(X_train), 100).reshape(-1, 1)
x_values_with_intercept = np.c_[np.ones(x_values.shape[0]), x_values]
predicted_probabilities = logistic.cdf(np.dot(x_values_with_intercept, estimated_theta))
plt.scatter(X, y, color='blue', marker='o', label='dataset')
plt.plot(x_values, predicted_probabilities, color='green', linewidth=2, label='Logistic Regression')
plt.xlabel('Feature')
plt.ylabel('Class')
plt.title('Logistic Regression with MLE')
plt.legend()
plt.savefig('output/graph.png')

Lines 2–3: We initialize the logistic regression parameters (theta) to zero with the length set to the number of features in the training data plus 1 to include the intercept term. Note that we only have one feature here, so the length of the initital_theta vector will be 2.
Line 5: We minimize the negative log-likelihood function using the minimize function.
Line 6: We store the estimated values of theta in the estimated_theta variable.
Line 11: We use the logistic cumulative distribution function (CDF) to calculate the predicted probabilities. The dot product of x_values_with_intercept and estimated_theta represents the linear combination of the parameters and features.
Lines 13–20: We plot the scatter plot of the data points and the logistic regression curve obtained through MLE.

Evaluating the model#

Finally, we can now evaluate our model by calculating the accuracy of our prediction on the test set as follows:

Feature Engineering for Machine Learning

Feature Engineering for Machine Learning

Feature engineering is a crucial stage in any machine learning project. It allows you to use data to define features that enable machine learning algorithms to work properly. In this course, you will learn the techniques that will help you create new features from existing features. You’ll start by diving into label encoding which is crucial for converting categorical features into numerical. You’ll also learn about other various types of encoding such as: one-hot, count, and mean, all of which are important for feature engineering. In the remaining chapters, you’ll learn about feature interaction and datetime features. In all, this course will show you the many different ways you can create features from existing ones.

30mins

Advanced

10 Playgrounds

1 Quiz

Bayesian Machine Learning for Optimization in Python

Bayesian optimization allows developers to leverage Bayesian inference and statistical modeling to efficiently search for the optimal solution in a high-dimensional space. Starting with the fundamentals of statistics and Bayesian statistics, you’ll explore different concepts of machine learning and its applications in software engineering. Next, you’ll discover different strategies for optimizations. Through practical examples and hands-on exercises, you’ll gain proficiency in implementing Bayesian optimization algorithms and fine-tuning them for specific tasks. By the end of the course, you’ll have a comprehensive understanding of the entire Bayesian optimization workflow, from problem formulation to solution optimization. By completing this course, you’ll be able to tackle complex optimization problems more efficiently and effectively. You’ll be equipped to find optimal solutions in areas such as hyperparameter tuning, experimental design, algorithm configuration, and system optimization.

8hrs

Intermediate

49 Playgrounds

1 Quiz

A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins

Beginner

108 Playgrounds

12 Quizzes