## Wednesday, March 12, 2014

### RBM and DBN

Some nice explanations on RBM and DBN from the paper Application of Deep Belief Network for Natural Language Understanding:

RBMs can be trained using unlabeled data and they can learn stochastic binary features which are good for modeling the higher-order statistical structure of a dataset. Even thought these features are discovered without considering the discriminative task for which they will be used, some of them are typically very useful for classification as well as for generation.

After training the network consisting of the visible layer and the first hidden layer, which we will refer to as $\text{RBM}_1$, its learned parameters, $\theta_1$, define $p(\boldsymbol{v}, \boldsymbol{h}|\theta_1), p(\boldsymbol{v}|\theta_1), p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ and $p(\boldsymbol{h}|\boldsymbol{v}, \theta_1)$ via equations
$p(h_j = 1 | \boldsymbol{v}) = \sigma(a_j + \sum_i v_i w_{ij})$
and
$p(v_i =1 | \boldsymbol{h}) = \sigma(b_i + \sum_j h_j w_{ij})$.
The parameters of $\text{RBM}_1$ also define a prior distribution over hidden vectors, $p(\boldsymbol{h}|\theta_1)$, which is obtained by marginalizing over the space of visible vectors. This allows $p(\boldsymbol{v}|\theta_1)$ to be written as:
$p(\boldsymbol{v}|\theta_1) = \sum_{\boldsymbol{h}} p(\boldsymbol{h}|\theta_1) p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$.
The idea behind training a DBN by training a stack of RBMs is to keep the $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ defined by $\text{RBM}_1$, but to improve $p(\boldsymbol{v})$ by replacing $p(\boldsymbol{h}|\theta_1)$ by a better prior over the hidden vectors. To improve $p(\boldsymbol{v})$, this better prior must have a smaller KL divergence than $p(\boldsymbol{h}|\theta_1)$ from the "aggregated posterior", which is the equally weighted mixture of the posterior distributions over the hidden vectors of $\text{RBM}_1$ on all $N$ of the training cases:
$\frac{1}{N} \sum_{\boldsymbol{v}\in \textbf{train}} p(\boldsymbol{h}|\boldsymbol{v}, \theta_1)$. The analogous state for Gaussian Mixture models is that the updated mixing proportion of a component should be closer to the average posterior probability of that component over all training cases.

Now consider training $\text{RBM}_2$, which is the network formed by using the samples from the averaged posterior of $\text{RBM}_1$ as training data. It is easy to ensure that the distribution which $\text{RBM}_2$ defines over its visible units is identical to $p(\boldsymbol{h}|\theta_1)$: we simply initialize $\text{RBM}_2$ to be an upside-down version of $\text{RBM}_1$ in which the roles of visible and hidden units have been swapped. So $\text{RBM}_2$ has $\boldsymbol{h}$ as a visible vector and $\boldsymbol{h}_2$ as a hidden vector. Then we train $\text{RBM}_2$ which makes $p(\boldsymbol{h}|\theta_2)$ be a better model of the aggregated posterior than $p(\boldsymbol{h}|\theta_1)$.

After training $\text{RBM}_2$, we can combine the two RBMs to create a hybrid of a directed and an undirected model. $p(\boldsymbol{h}|\theta_2)$ is defined by the undirected $\text{RBM}_2$, buy $p(\boldsymbol{v}|\boldsymbol{h}, \theta_1)$ is defined by directed connections from the first hidden layer to the visible units. In this hybrid model, which we call a deep belief net, extract inference of $p(\boldsymbol{h}|\boldsymbol{v}, \theta_1, \theta_2)$ is no longer easy because the prior over the hidden vectors is no longer defined by $\theta_1$. However, if we perform approximate inference for the first hidden layer by using equation $p(h_j=1|\boldsymbol{v})=\sigma(a_j + \sum_i v_i w_{ij})$, there is a variational lower bound on the log probability of the training data that is improved every time we add another layer to the DBN, provided we add it in the appropriate way.

After training a stack of RBMs, the bottom up recognition weights of the resulting DBN can be used to initialize the weights of a multi-layer feed-forward neural network, which can then be discriminatively fine-tuned by backpropagating error derivatives. The feed-forward network is given a final "softmax" layer that computes a probability distribution over class labels and the derivative of the log probability  of the correct class is backpropagated to train the incoming weights of the final layer and to discriminatively fine-tune the weights in all lower layers.

In principle, adding more layers improves modeling power, unless the DBN already perfectly models the data. In practice, however, little is gained by using more than about 3 hidden layers.