...
/The Deep Learning Approach to Natural Language Processing
The Deep Learning Approach to Natural Language Processing
Learn about the deep learning approach to natural language processing.
Deep learning revolution
It’s safe to say that deep learning revolutionized machine learning, especially in fields such as computer vision, speech recognition, and of course, NLP. Deep models created a wave of paradigm shifts in many of the fields of machine learning because deep models learned rich features from raw data instead of using limited human-engineered features. This consequentially caused the pesky and expensive feature engineering to be obsolete. With this, deep models made the traditional workflow more efficient since deep models perform feature learning and task learning simultaneously. Moreover, due to the massive number of parameters (that is, weights) in a deep model, it can encompass significantly more features than a human could’ve engineered.
However, deep models are considered a opaque due to the poor interpretability of the model. For example, understanding the “how” and “what” features learned by deep models for a given problem is still an active area of research. But it’s important to understand that there is a lot more research focusing on the model interpretability of deep learning models.
What are deep models?
A deep neural network is essentially an artificial neural network that has an input layer, many interconnected hidden layers in the middle, and finally, an output layer (for example, a classifier or a regressor). As we can see, this forms an end-to-end model from raw data to predictions. These hidden layers in the middle give the power to deep models because they are responsible for learning the good features from raw data, eventually succeeding at the task at hand. Let’s now look at the history of deep learning briefly.
History of deep learning
Let’s briefly discuss the roots of deep learning and how the field evolved to be a very promising technique for machine learning. In 1960, Hubel and Weisel performed an interesting experiment and discovered that a cat’s visual cortex is made of simple and complex cells and that these cells are organized in a hierarchical form. Also, these cells react differently to different stimuli. For example, simple cells are activated by variously oriented edges, while complex cells are insensitive to spatial variations (for example, the orientation of the edge). This kindled the motivation for replicating a similar behavior in machines, giving rise to the concept of artificial neural networks.
Neural networks
In the years that followed, neural networks gained the attention of many researchers. In 1965, a neural network trained by a method known as the group method of data handling (GMDH) and based on the famous perceptron by Rosenblatt was introduced by Ivakhnenko and others. Later, in 1979, Fukushima introduced the neocognitron, which planted the seeds for one of the most famous variants of deep models—convolutional neural networks (CNNs). Unlike the perceptrons, which always took in a 1D input, a neocognitron was able to process 2D inputs using convolution operations.
Artificial neural networks are used to backpropagate the error signal to optimize the network parameters by computing the gradients of the loss with respect to weights of a given layer. Then, the weights are updated by pushing them in the opposite direction of the gradient in order to minimize the loss. For a layer further away from the output layer (i.e., where the loss is computed), the algorithm uses the chain rule to compute gradients. The chain rule used with many layers led to a practical problem known as the vanishing gradients problem, strictly limiting the potential number of layers (depth) of the neural network. The gradients of layers closer to the inputs (i.e., further away from the output layer), being very small, cause the model training to stop prematurely, leading to an underfitted model. This is known as the vanishing gradients phenomenon.
Then, in 2006, it was found that pretraining a deep neural network by minimizing the reconstruction error (obtained by trying to compress the input to a lower dimensionality and then reconstructing it back into the original dimensionality) for each layer of the network provides a good initial starting point for the weight of the neural network; this allows a consistent flow of gradients from the output layer to the input layer. This essentially allowed neural network models to have more layers without the ill effects of the vanishing gradient. Also, these deeper models were able to surpass traditional machine learning models in many tasks, mostly in computer vision (for example, test accuracy for the MNIST handwritten digit dataset). With this breakthrough, deep learning became the buzzword in the machine learning community.
Progress in speech recognition
Things started gaining progressive momentum when, in 2012, AlexNet (a deep convolutional neural network created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton) won the Large Scale Visual Recognition Challenge (LSVRC) 2012 with an error decrease of 10% from the previous best. During this time, advances were made in speech recognition, wherein state-of-the-art speech recognition accuracies were reported using deep neural networks. Furthermore, people began to realize that graphical processing units (GPUs) enable more parallelism, which allows for faster training of larger and deeper networks compared with central processing units (CPUs).