Deep Learning Basics

Get familiar with the foundational concepts of deep learning, artificial neural networks, and parameter estimation.

Deep learning is a subset of machine learning, which is a field of artificial intelligence that uses mathematics and computers to learn from data and map it from some input to some output. Loosely speaking, a map or a model is a function with parameters that maps the input to an output. Learning the map, also known as mode, occurs by updating the parameters of the map such that some expected empirical loss is minimized. The empirical loss is a measure of the distance between the values predicted by the model and the target values given the empirical data.

Notice that this learning setup is extremely powerful because it does not require having an explicit understanding of the rules that define the map. An interesting aspect of this setup is that it does not guarantee that we will learn the exact map that maps the input to the output, but some other maps, as expected, predict the correct output.

This learning setup, however, does not come without a price: some deep learning methods require large amounts of data, especially when compared with methods that rely on feature engineering. Fortunately, there is a wide availability of free data, especially unlabeled, in many domains.

Meanwhile, the term deep learning refers to the use of multiple layers in an ANN to form a deep chain of functions. The term ANN suggests that such models informally draw inspiration from theoretical models of how learning could happen in the brain. ANNs, also referred to as deep neural networks, are the main class of models considered in this course.

Artificial neural networks (ANNs)

Despite its recent success in many applications, deep learning is not new, and according to Ian Goodfellow, Yoshua Bengio, and Aaron Courville, there have been three eras:

  • Cybernetics between the 1940s and the 1960s

  • Connectionism between the 1980s and the 1990s

  • The current deep learning renaissance, beginning in 2006

Mathematically speaking, a neural network is a graph consisting of non-linear equations whose parameters can be estimated using methods such as stochastic gradient descent and backpropagation. We will introduce ANNs step by step, starting with linear and logistic regression.

Linear regression

Linear regression is used to estimate the parameters of a model to describe the relationship between an output variable and the given input variables. It can be mathematically described as a weighted sum of input variables:

Here, the weight, w,w, and inputs, X,X, are vectors in Rd\mathbb{R}^d; in other words, they are real-valued vectors with dd dimensions, bb as a scalar bias term and zz as a scalar term that represents the valuation of the function ff at the input xx. In ANNs, the output of a single neuron without non-linearities is similar to the output of the linear model described in the preceding linear regression equation and the following diagram:

Press + to interact
Linear regression
Linear regression

Logistic regression

Logistic regression is a special version of regression where a specific non-linear function, the sigmoid function, is applied to the output of the linear model in the earlier linear regression equation:

In ANNs, the non-linear model described in the logistic regression equation is similar to the output of a single neuron with a sigmoid non-linearity in the following diagram:

Press + to interact
Logistic regression
Logistic regression

A combination of such neurons defines a hidden layer in a neural network, and the neural networks are organized as a chain of layers. The output of a hidden layer is described by the following equation and diagram:

Here, the weight, w(l)w^{(l)}, and the input, xx, are vectors in RdR^d; b(l)b^{(l)} is a scalar bias term, h(l)h^{(l)} is a vector, and gg is a non-linearity:

Press + to interact
Fully connected neural netwrok
Fully connected neural netwrok

The preceding diagram depicts a fully connected neural network with two inputs, two hidden layers with three nodes each, and one output node.

In general, neural networks have a chain-like structure that is easy to visualize in equation form or as a graph, as the previous diagram confirms. For example, consider the ff and gg functions that are used in the y=g(f(x))y= g(f(x)) model. In this simple model of a neural network, the input xx is used to produce the output f(x)f(x); the output of f(x)f(x) is used as the input g(f(x))g(f(x)) that finally produces yy.

In this simple model, the function ff is considered to be the first hidden layer, and the function gg is considered to be the second hidden layer. These layers are called hidden because, unlike the input and output values of the model that are known a priori, their values are not known.

In each layer, the network is learning features or projections of the data that are useful for the task at hand. For example, in computer vision, there is evidence that the layers of the network closer to the input can learn filters that are associated with basic shapes, whereas in the layers closer to the output, the network might learn filters that are closer to images.

The following figure, taken from the paper “Visualizing and Understanding Convolutional Networks” by Zeiler and Fergus, provides a visualizationFor a thorough introduction to the topic of neural network visualization, refer to Stanford's class on Convolutional Networks for Visual Recognition. of the filters on the first convolution layer of a trained AlexNet:

Press + to interact
Visualization of features in a neural network
Visualization of features in a neural network

The parameter estimation

The output of each layer on the network is dependent on the parameters of the model estimated by training the neural network to minimize the loss with respect to the weights, L(w)L(w), as we described earlier. This is a general principle in machine learning, in which a learning procedure, for example, backpropagation, uses the gradients of the error of a model to update its parameters to minimize the error. Consider estimating the parameters of a linear regression model such that the output of the model minimizes the mean squared error (MSE). Mathematically speaking, the point-wise error between the wTxw^Tx predictions and the yy target value is computed as follows:

The MSE is computed as follows:

Here, eTee^Te presents the sum of the squared errors (SSE) and 1n\frac{1}{n} normalizes the SSE with the number of samples to reach the MSE.

In the case of linear regression, the problem is convex, and the MSE has the simple solution given in the following equation:

Here, ww is the coefficient of the linear model, XX refers to matrices with the observations and their respective features, and yy is the response value associated with each observation. Note that this closed-form solution requires the XX matrix to be invertible and, therefore, to have a determinant larger than zero.

In the case of models where a closed solution to the loss does not exist, we can estimate the parameters that minimize the MSE by computing the partial derivative of each weight with respect to the MSE loss, LL, and using the negative of that value, scaled by a learning rate, α\alpha, to update the ww parameters of the model being evaluated:

A model in which many of the coefficients in ww are zero is said to be sparse. Given the large number of parameters or coefficients in deep learning models, producing models that are sparse is valuable because it can reduce computation requirements and produce models with faster inference.