We have seen that an RBM with a single hidden layer can be used to learn a generative model of images; in fact, theoretical work has suggested that with a sufficiently large number of hidden units, an RBM can approximate any distribution with binary valuesPearl J., Russell S. (2000). BAYESIAN NETWORKS. https://ftp.cs.ucla. edu/pub/stat_ser/r277.pdf. However, in practice, for very large input data, it may be more efficient to add additional layers instead of a single large layer, allowing a more “compact” representation of the data.
Researchers who developed DBNs also noted that adding additional layers can only lower the log-likelihood of the lower bound of the approximation of the data reconstructed by the generative modelHinton GE, Osindero S, Teh YW. (2006) A fast learning algorithm for deep belief nets. Neural Comput. 18(7):1527-54. https://www.cs.toronto.edu/~hinton/ absps/fastnc.pdf. In this case, the hidden layer output h of the first layer becomes the input to a second RBM; we can keep adding other layers to make a deeper network. Furthermore, if we wanted to make this network capable of learning not only the distribution of the image (x) but also the label—the digit it represents from 0 to 9 (y)—we could add another layer to a stack of connected RBMs. This layer would use a softmax function to represent the probability distribution over the 10 possible digit classes.
Stacking RBMs
A problem with training a very deep graphical model such as stacked RBMs is the “explaining-away effectHinton GE, Osindero S, and Teh YW. A fast learning algorithm for deep belief nets, 1527-54..” Recall that the dependency between variables can complicate the inference of the state of hidden variables