How to tune the hyperparameters

Overview

Learning to choose the right hyperparameters is one of the best ways to extract the most from our machine learning or deep learning models. In this article, we’ll explore five different hyperparameters. These five include:

  • No. of epochs
  • No. of hidden layers
  • Learning rate
  • Loss function
  • Activation function

The process of tuning hyperparameters is an integral part of deep learning. We must understand the significance of the tuning process before building any models. This will allow us to extract the maximum performance from our models and serve as leverage in building top-performing models.

Tuning hyperparameters

Let’s discuss each hyperparameter individually before jumping into the practice.

Application screenshot

Picking the number of epochs

We can easily tune epochs because it’s the easiest hyperparameter. We already know that if we train a system long enough, it becomes more accurate. However, if we train it further, we start to underperform and might even become counterproductive and decrease our accuracy.

Tuning the number of hidden layers

We don’t need a hidden layer if our data is linear. We need to figure out how complex our data is and decide how many hidden layers we need. Adding more can improve it, but the increased complexity could lead to overfitting. It’s best to stick to one or two digits for the number of layers, since we need more.

Tuning the learning rate

To understand the trade-off of different learning rates, let’s go back to the basics and visualize gradient descent. The following diagrams show a few steps of gradient descent along a one-dimensional loss curve, with three different values of lr. The red cross marks the starting point, and the green cross marks the minimum:

Learning rate comparison

When we set a significant value for lr, gradient descent tries to minimize the loss with substantial steps. It’s often used for large, sparse datasets because even if the algorithm does not converge, it can still uncover patterns. The opposite case is batch gradient descent, where each algorithm step is small. Still, it executes them all at once: we train a network on many examples in one pass.

Using a smaller value for lr is more efficient and often preferred. If we have a smaller dataset and want to find the minimum faster, it will yield better results.

Choosing the right loss function

The goal of a loss function is to evaluate the “goodness” of its model. There is no one-size-fits-all loss function. They are usually picked based on the machine learning problem we’re trying to solve, which features we’re using, and so on.

There are two broad categories depending on the learning task we’re dealing with — regression losses and classification losses. Mean squared error is one good loss function in regression cases, whereas categorical cross entropy loss is quite handy in classification.

Activation function

The neuron’s activation function returns a value between 0 and 1 as it determines if the neuron is relevant or should be ignored. The activation function decides how the neurons combine inputs to form the final output.

Sigmoid

We use the placeholder sigmoid activation function for the output layer of a binary classification. The value of this node depends on whether its input value is more significant than 0.5, in which case it’ll return 1, or else it’ll return 0.

Tanh

The hyperbolic tangent activation function is similar to the sigmoid function. It takes any real value as input and outputs values in the range of -1 to 1. Just like the sigmoid activation function, hyperbolic tangent activation has an S-shaped curve that ranges between “off” (x = 0) and “on” (x = 1).

ReLu

ReLU is one of the most straightforward and efficient activation functions in deep learning. At a time, only a few neurons are activated, making the network sparse, efficient, and easy for computation.

ReLU neurons are not differentiable at 0. They tend to become inactive for all inputs. ReLU neurons can cause problems when learning at high rates; specifically, they can reduce the model’s capacity to learn.

Softmax

We can use Softmax for multi-class classification to return the probability of each class, and the target class will have the highest probability.

It’s often used in the last layer of neural networks.

Let’s run the application given below and tune hyperparameters without coding for non-linearly separable data.

# A utility function that plots the training loss and validation loss from
# a Keras history object.
import streamlit as st
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns


def plot(history):
    plt.clf()
    plt.plot(history.history['loss'], label='Training set',
             color='blue', linestyle='-')
    plt.plot(history.history['val_loss'], label='Validation set',
             color='green', linestyle='--')
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.xlim(0, len(history.history['loss']))
    plt.legend()
    plt.title("Training vs. Validation (loss)", fontsize=10)
    plt.show()
Hyperparameters tuning application

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved