What is a gated recurrent unit (GRU)?

The gated recurrent unit (GRU) is a specialized variant of recurrent neural networks (RNNs) developed to tackle the limitations of conventional RNNs, such as the vanishing gradient problem. GRUs have been successful in various applications, including natural language processing, speech recognition, and time series prediction.

We will explore the inner workings of GRUs, delve into the mathematics behind their architecture, and understand how they triumph over traditional RNNs.

Recurrent neural networks (RNNs)

Before delving into the GRU, let’s briefly look at recurrent neural networks (RNNs). These networks are designed to process sequential data by maintaining hidden states that act as context for processing subsequent inputs. However, traditional RNNs suffer from the vanishing gradient problem, hindering their ability to learn long-range dependencies effectively.

Recurrent neural networks (RNNs)
Recurrent neural networks (RNNs)

Understanding GRU

The GRU presents itself as an innovative solution to the vanishing gradient problem in traditional RNNs. It incorporates gating mechanisms that enable selective information update and resetting in the hidden state. This mechanism empowers the GRU to retain essential information and forget irrelevant data, facilitating the learning of long-term dependencies.

GRU architecture

The architecture of the gated recurrent unit (GRU) is designed with two specific gates - the update gate and the reset gate. Each gate serves a unique purpose, significantly contributing to the GRU's high efficiency. The reset gate identifies short-term relationships, while the update gate recognizes long-term connections.

The various components of the architecture are:

  • Update gate (Z): Determines the degree of past information forwarded to the future.

  • Reset gate (R): Decides the amount of past information to discard.

  • Candidate hidden state (H'): Creates new representations, considering both the input and the past hidden state.

  • Final hidden state (H): A blend of the new and old memories governed by the update gate.

The GRU architecture can be illustrated as:

GRU architecture
GRU architecture

Mathematical formulation

Let’s dive into the mathematical equations that define the behavior of the GRU:

Update gate

The computation of the update gate is the first step in a GRU. It employs the current input and the previous hidden state to decide how much the previous hidden state should be updated. The sigmoid function is used here. The equation is as follows:

Where:

  • ztz_t represents the update gate vector.

  • σ\sigma denotes the sigmoid function.

  • WzW_z signifies the weight matrix for the update gate.

  • ht1h_{t-1} is the previous hidden state.

  • xtx_t stands for the current input.

  • bzb_z is the bias for the update gate.

  • [ht1,xt][h_{t-1}, x_t] represents the concatenation of ht1h_{t-1} and xtx_t.

Reset gate

The reset gate calculation, similar to the update gate, uses the sigmoid function. It identifies the volume of past information to be discarded.

Where:

  • rtr_t represents the reset gate vector.

  • WrW_r signifies the weight matrix for the reset gate.

  • brb_r is the bias for the reset gate.

Reset gate and update gate
Reset gate and update gate

Candidate hidden state

After the reset gate computation, a candidate hidden state is computed employing the hyperbolic tangent function (tanh). The value of the reset gate determines the influence of the previous hidden state.

Where:

  • hth'_t is the candidate hidden state.

  • WW is the weight matrix used in this computation.

  • bb represents the bias.

  • \ast indicates element-wise multiplication.

Adding candidate hidden state in the illustration:

Candidate hidden state
Candidate hidden state

Final hidden state

The final hidden state (also known as the current hidden state) is subsequently defined through linear interpolation. This interpolation involves the previous hidden state and the prospective hidden state, and the update gate influences its degree.

Where hth_tis the current hidden state.

Adding the final hidden state to the illustration:

Final hidden state
Final hidden state

Code implementation

Here's a simple Python code to implement gated recurrent unit (GRU):

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense
# Generate random data for demonstration
# Let's say we have 100 sequences, each of length 10, and each sequence item has 8 features
X = np.random.randn(100, 10, 8)
# Target could be anything; here's random regression targets for demonstration
y = np.random.randn(100)
# Build a simple GRU model
model = Sequential()
model.add(GRU(50, input_shape=(10, 8), return_sequences=True)) # 50 GRU units, return sequences for potential stacking
model.add(GRU(50)) # Another layer of GRU with 50 units
model.add(Dense(1)) # Regression output
model.compile(optimizer='adam', loss='mse')
# Train the model
model.fit(X, y, epochs=10)
print("Model has been trained!")
# Predict with the trained model
sample_input = np.random.randn(1, 10, 8)
predicted_output = model.predict(sample_input)
print(f"Predicted Output: {predicted_output}")

Code explanation

Let's breakdown the code:

Lines 1–4: Import necessary libraries

  • numpy is imported to handle numerical operations and data generation.

  • Various components are imported from tensorflow and keras for building and training the GRU model.

Lines 6–10: Generate random data for demonstration

  • This creates a dataset where each input instance is a sequence with 10 time steps, and each time step contains 8 features.

  • Random target values are generated for demonstration purposes.

Lines 1316: Construct a simple neural network model with GRU layers

  • A sequential model is created using Keras, allowing us to build the model layer by layer.

  • A GRU layer with 50 units is added as the first layer. This layer will return sequences, enabling the potential stacking of other recurrent layers.

  • Another GRU layer with 50 units is added.

  • A dense layer with a single unit is added for regression output.

Line 18: Compile and setup the model for training

  • The model is compiled using the Adam optimization algorithm and the mean squared error loss function, indicating a regression task.

Line 21: Train the model on the generated data

  • The model is trained using the random data for 10 epochs.

Lines 2628: Make a prediction using the trained model

  • A random input sequence is generated.

  • The trained model predicts the output for this sequence, and the prediction is printed.

Advantages of GRUs

The Gated Recurrent Unit offers several advantages over traditional RNNs:

  • Efficient training: GRUs facilitate more effective gradient flow during training, allowing the model to grasp long-range dependencies more effectively.

  • Simplicity: GRUs are simpler than more complex RNN architectures like LSTM, making them easier to implement and train.

  • Faster computation: The reduced number of parameters in GRUs makes them computationally more efficient than LSTMs.

Applications of GRUs

GRUs find diverse applications in various domains, including:

  • Natural language processing: GRUs excel in tasks such as language modeling, machine translation, sentiment analysis, and text generation.

  • Speech recognition: GRUs play a vital role in automatic speech recognition systems for sequence-to-sequence modeling.

  • Time series prediction: GRUs are effective in forecasting tasks like stock price prediction, weather forecasting, and demand prediction.

Conclusion

The gated recurrent unit (GRU) stands as a powerful solution to the challenges posed by sequential data processing. By addressing the limitations of traditional RNNs through its innovative gating mechanisms, GRU has become a fundamental tool in various machine learning tasks. Its impact on natural language processing, speech recognition, and time series prediction is profound, and it continues to inspire further advancements in deep learning for sequential data.

Test your knowledge

Match The Answer
Select an option from the left-hand side

Update gate

Problem in traditional RNNs

Reset gate

Governs past information to the future

Candidate hidden state

Decides past information to discard

Final hidden state

Blend of new and old memory

Vanishing gradient problem

Considers input and past hidden state


Copyright ©2024 Educative, Inc. All rights reserved