Estimation is the process of making an educated guess about a population parameter based on a sample from that population. This is a crucial step because collecting data from the entire population and inferences is often impractical.
In machine learning, we use a model to represent real-world phenomena. These models are based on parameters that influence our prediction because the reliability of our predictions would only be as good as the parameters that govern the model.
This blog will discuss the importance of parameter estimation in machine learning. We will specifically look into maximum likelihood estimation and its practical implementation in Python in estimating the parameters of a machine learning regression model.
Maximum likelihood estimation (MLE) is a statistical approach that determines the models’ parameters in machine learning. The idea is to find the values of the model parameters that maximize the likelihood of observed data such that the observed data is most probable.
Let’s look at an example to understand MLE better. Assume that we want to estimate the average height of a city’s population. However, because of the sheer size of the population, we cannot calculate the true average height of the population. So, we estimate the average height as follows:
Defining a statistical model: We start by assuming that the height of the population follows a normal distribution. This implies that few people have a shorter or taller height than average.
Collecting the sample: We then collect a sample of heights from the population and find the average height based on that sample.
Calculating the likelihood function: Given the population’s average height, we look at the likelihood of observing heights. The likelihood function represents the probability of observing the provided data given the parameters in our model. In our case, the model’s parameters are the normal distribution’s mean and standard deviation. Due to computational reasons, the log-likelihood function is often used instead of the likelihood function.
Maximizing the likelihood function: MLE aims to find the average height that maximizes the log-likelihood function of obtaining the observed sample and makes the observed heights most probable.
We can now model the average height with a normal distribution whose parameters are selected by maximizing the likelihood function.
In supervised machine learning, we use labeled data that trains the model’s parameters. The training data consists of input features and the corresponding output labels. During the training phase, we aim to find the model parameters that best capture the patterns in the labeled data.
MLE helps fine-tune the machine learning models. In the training phase, we adjust the model’s parameters to maximize the likelihood of the labeled data. Alternatively, we can use a negative log-likelihood that represents the loss function. A loss function quantifies the difference between predicted and actual values and is defined as follows:
Here, represents the actual output and represents the estimated value. We aim to minimize this loss function L during training to reach an accurate and effective model. Note that minimizing the negative log-likelihood is equivalent to maximizing the likelihood, and this is a common objective in the training of probabilistic models.
Let’s take an example of a machine learning regression model where we want to estimate the model’s parameters that maximize the likelihood of observing the given data using MLE.
Let’s take an example of a binary classification problem with two possible classes. The logistic regression model is a commonly used algorithm for a binary classification problem. The model’s parameters in logistic regression include coefficients associated with each feature and an intercept term. The intercept term is an additional parameter included in the model that represents the log-odds of the event being true when all the features are zero.
We’ll start by generating a synthetic dataset for a binary classification problem with a single feature as follows:
from sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_split# Generate synthetic dataset for binary classificationX, y = make_classification(n_samples=100, n_features=1, n_informative=1,n_redundant=0, n_clusters_per_class=1)# Split the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42)
Lines 1–2: We import the make_classification
and train_test_split
methods.
Lines 5–6: We create a synthetic dataset with the feature X
and the target y
using the make_classification
function. The dataset consists of n_samples=100
the samples, n_features=1
feature, and the n_informative=1
informative feature, resulting in a binary classification problem.
Lines 9–11: We split X
and y
into training and test datasets using the train_test_split
method. We use the training dataset to train the model and the test to evaluate the model’s performance.
We will now calculate the negative log-likelihood for logistic regression. The likelihood function models the probability of observing the given binary outcomes (0 or 1) given the input features. We will calculate the negative log-likelihood during optimization and minimize this function, which is equivalent to maximizing the likelihood.
Mathematically, the negative log-likelihood in logistic regression can be written as follows:
Let’s break down the equation to understand it better:
Let’s see how we can implement this in Python:
import numpy as np# Define the negative log-likelihood function for logistic regressiondef neg_log_likelihood(theta, X, y):z = np.dot(X, theta)n_log_likelihood = -np.sum(y * z - np.log(1 + np.exp(z)))return n_log_likelihood
Now, we need to find the parameters that minimize the negative log-likelihood as follows:
# Use MLE to estimate logistic regression parametersinitial_theta = np.zeros(X_train.shape[1] + 1) # Initial guess for coefficients (including intercept)X_train_with_intercept = np.c_[np.ones(X_train.shape[0]), X_train] # Add intercept to featuresresult = minimize(neg_log_likelihood, initial_theta, args=(X_train_with_intercept, y_train), method='BFGS')estimated_theta = result.x# Plot the logistic regression curvex_values = np.linspace(min(X_train), max(X_train), 100).reshape(-1, 1)x_values_with_intercept = np.c_[np.ones(x_values.shape[0]), x_values]predicted_probabilities = logistic.cdf(np.dot(x_values_with_intercept, estimated_theta))plt.scatter(X, y, color='blue', marker='o', label='dataset')plt.plot(x_values, predicted_probabilities, color='green', linewidth=2, label='Logistic Regression')plt.xlabel('Feature')plt.ylabel('Class')plt.title('Logistic Regression with MLE')plt.legend()plt.savefig('output/graph.png')
Lines 2–3: We initialize the logistic regression parameters (theta) to zero with the length set to the number of features in the training data plus 1
to include the intercept term. Note that we only have one feature here, so the length of the initital_theta
vector will be 2.
Line 5: We minimize the negative log-likelihood function using the minimize
function.
Line 6: We store the estimated values of theta in the estimated_theta
variable.
Line 11: We use the logistic cumulative distribution function (CDF) to calculate the predicted probabilities. The dot product of x_values_with_intercept
and estimated_theta
represents the linear combination of the parameters and features.
Lines 13–20: We plot the scatter plot of the data points and the logistic regression curve obtained through MLE.
Finally, we can now evaluate our model by calculating the accuracy of our prediction on the test set as follows:
# Evaluate the model on the test setX_test_with_intercept = np.c_[np.ones(X_test.shape[0]), X_test]predicted_probabilities_test = logistic.cdf(np.dot(X_test_with_intercept, estimated_theta))predicted_labels_test = (predicted_probabilities_test >= 0.5).astype(int)# Print accuracy on the test setaccuracy = np.mean(predicted_labels_test == y_test)print("Accuracy on the test set:", accuracy)
Line 4: We set the threshold 0.5
to obtain a binary prediction.
Line 7: We calculate the accuracy of the predicted labels by comparing these with the actual labels.
We looked at how MLE is used in determining the model parameters in machine learning using a hands-on Python implementation. If you want to learn about machine learning models and how they are implemented in Python, we encourage you to explore the following courses on Educative:
Feature Engineering for Machine Learning
Feature engineering is a crucial stage in any machine learning project. It allows you to use data to define features that enable machine learning algorithms to work properly. In this course, you will learn the techniques that will help you create new features from existing features. You’ll start by diving into label encoding which is crucial for converting categorical features into numerical. You’ll also learn about other various types of encoding such as: one-hot, count, and mean, all of which are important for feature engineering. In the remaining chapters, you’ll learn about feature interaction and datetime features. In all, this course will show you the many different ways you can create features from existing ones.
Bayesian Machine Learning for Optimization in Python
Bayesian optimization allows developers to leverage Bayesian inference and statistical modeling to efficiently search for the optimal solution in a high-dimensional space. Starting with the fundamentals of statistics and Bayesian statistics, you’ll explore different concepts of machine learning and its applications in software engineering. Next, you’ll discover different strategies for optimizations. Through practical examples and hands-on exercises, you’ll gain proficiency in implementing Bayesian optimization algorithms and fine-tuning them for specific tasks. By the end of the course, you’ll have a comprehensive understanding of the entire Bayesian optimization workflow, from problem formulation to solution optimization. By completing this course, you’ll be able to tackle complex optimization problems more efficiently and effectively. You’ll be equipped to find optimal solutions in areas such as hyperparameter tuning, experimental design, algorithm configuration, and system optimization.
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
Free Resources