# Boltzmann Machines 2

## #The Hopfield net as a distribution

### #The Helmholtz Free Energy of a System

• At any time, the probability of finding the system in state $s$ at temperature $T$ is $P_T(s)$

• At each state it has a potential energy $E_s$

• The internal energy of the system, representing its capacity to do work, is the average

• $$U_{T}=\sum_{S} P_{T}(s) E_{S}$$
• The capacity to do work is counteracted by the internal disorder of the system, i.e. its entropy

• $$H_{T}=-\sum_{S} P_{T}(s) \log P_{T}(s)$$
• The Helmholtz free energy of the system measures the useful work derivable from it and combines the two terms

• $$F_{T}=U_{T}+k T H_{T}$$
• $$=\sum_{S} P_{T}(s) E_{S}-k T \sum_{S} P_{T}(s) \log P_{T}(s)$$
• The probability distribution of the states at steady state is known as the Boltzmann distribution

• Minimizing this w.r.t $P_T(s)$, we get

• $$P_{T}(s)=\frac{1}{Z} \exp \left(\frac{-E_{S}}{k T}\right)$$

• $Z$ is a normalizing constant

### #Hopfield net as a distribution

• $E(S)=-\sum_{i<j} w_{i j} s_{i} s_{j}-b_{i} s_{i}$
• $P(S)=\frac{\exp (-E(S))}{\sum_{S^{\prime}} \exp \left(-E\left(S^{\prime}\right)\right)}$
• The stochastic Hopfield network models a probability distribution over states
• It is a generative model: generates states according to $P(S)$

### #The field at a single node

• Let's take one node as example

• Let $S$ and $S^\prime$ be the states with the +1 and -1 states

• $P(S)=P\left(s_{i}=1 \mid s_{j \neq i}\right) P\left(s_{j \neq i}\right)$
• $P\left(S^{\prime}\right)=P\left(s_{i}=-1 \mid s_{j \neq i}\right) P\left(s_{j \neq i}\right)$
• $\log P(S)-\log P\left(S^{\prime}\right)=\log P\left(s_{i}=1 \mid s_{j \neq i}\right)-\log P\left(s_{i}=-1 \mid s_{j \neq i}\right)$
• $\log P(S)-\log P\left(S^{\prime}\right)=\log \frac{P\left(s_{i}=1 \mid s_{j \neq i}\right)}{1-P\left(s_{i}=1 \mid s_{j \neq i}\right)}$
• $\log P(S)=-E(S)+C$

• $E(S)=-\frac{1}{2}\left(E_{\text {not } i}+\sum_{j \neq i} w_{i j} s_{j}+b_{i}\right)$
• $E\left(S^{\prime}\right)=-\frac{1}{2}\left(E_{\text {not } i}-\sum_{j \neq i} w_{i j} s_{j}-b_{i}\right)$
• $\log P(S)-\log P\left(S^{\prime}\right)=E\left(S^{\prime}\right)-E(S)=\sum_{j \neq i} w_{i j} S_{j}+b_{i}$

• $\log \left(\frac{P\left(s_{i}=1 \mid s_{j \neq i}\right)}{1-P\left(s_{i}=1 \mid s_{j \neq i}\right)}\right)=\sum_{j \neq i} w_{i j} s_{j}+b_{i}$

• $P\left(s_{i}=1 \mid s_{j \neq i}\right)=\frac{1}{1+e^{-\left(\sum_{j \neq i} w_{i j} s_{j}+b_{i}\right)}}$

• The probability of any node taking value 1 given other node values is a logistic

## #Redefining the network

• Redefine a regular Hopfield net as a stochastic system
• Each neuron is now a stochastic unit with a binary state $s_i$, which can take value 0 or 1 with a probability that depends on the local field
• $z_{i}=\sum_{j} w_{i j} s_{j}+b_{i}$
• $P\left(s_{i}=1 \mid s_{j \neq i}\right)=\frac{1}{1+e^{-z_{i}}}$
• Note
• The Hopfield net is a probability distribution over binary sequences (Boltzmann distribution)
• The conditional distribution of individual bits in the sequence is a logistic
• The evolution of the Hopfield net can be made stochastic
• Instead of deterministically responding to the sign of the local field, each neuron responds probabilistically
• Recall patterns ◎ Annealing

### #The Boltzmann Machine

• The entire model can be viewed as a generative model
• Has a probability of producing any binary vector $y$
• $E(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}$
• $P(\mathbf{y})=\operatorname{Cexp}\left(-\frac{E(\mathbf{y})}{T}\right)$
• Training a Hopfield net: Must learn weights to “remember” target states and “dislike” other states
• Must learn weights to assign a desired probability distribution to states
• Just maximize likelihood

### #Maximum Likelihood Training

• $\log (P(S))=\left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)-\log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)$

• $\mathcal{L}=\frac{1}{N} \sum_{S \in \mathbf{S}} \log (P(S)) =\frac{1}{N} \sum_{S}\left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)-\log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)$

• Second term derivation

• $\frac{d \log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)}{d w_{i j}}=\sum_{S^{\prime}} \frac{\exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime \prime} s_{j}^{\prime}\right)} s_{i}^{\prime} s_{j}^{\prime}$
• $\frac{d \log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)}{d w_{i j}}=\sum_{S_{\prime}} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime}$
• The second term is simply the expected value of $s_iS_j$, over all possible values of the state
• We cannot compute it exhaustively, but we can compute it by sampling!
• Overall gradient ascent rule

• $w_{i j}=w_{i j}+\eta \frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}$
• Overall Training

• Initialize weights
• Let the network run to obtain simulated state samples
• Compute gradient and update weights
• Iterate
• Note the similarity to the update rule for the Hopfield network

• The only difference is how we got the samples

## #Adding Capacity ◎ Expanding the network
• Visible neurons

• The neurons that store the actual patterns of interest
• Hidden neurons

• The neurons that only serve to increase the capacity but whose actual values are not important
• We could have multiple hidden patterns coupled with any visible pattern

• These would be multiple stored patterns that all give the same visible output
• We are interested in the marginal probabilities over visible bits

• $S=(V,H)$
• $P(S)=\frac{\exp (-E(S))}{\sum_{S^{\prime}} \exp \left(-E\left(S^{\prime}\right)\right)}$
• $P(S) = P(V,H)$
• $P(V)=\sum_{H} P(S)$
• Train to maximize probability of desired patterns of visible bits

• $E(S)=-\sum_{i<j} w_{i j} s_{i} s_{j}$
• $P(S)=\frac{\exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}$
• $P(V)=\sum_{H} \frac{\exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}$
• Maximum Likelihood Training

$$\log (P(V))=\log \left(\sum_{H} \exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)\right)-\log \left(\sum_{S_{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)$$

$$\mathcal{L}=\frac{1}{N} \sum_{V \in \mathbf{V}} \log (P(V))$$ $$\frac{d \mathcal{L}}{d w_{i j}}=\frac{1}{N} \sum_{V \in \mathbf{V}} \sum_{H} P(S \mid V) s_{i} s_{j}-\sum_{S !} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime}$$

• $\sum_{H} P(S \mid V) s_{i} s_{j} \approx \frac{1}{K} \sum_{H \in \mathbf{H}_{s i m u l}} s_{i} S_{j}$

• Computed as the average sampled hidden state with the visible bits fixed

• $\sum_{S^{\prime}} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime} \approx \frac{1}{M} \sum_{S_{i} \in \mathbf{S}_{s i m u l}} s_{i}^{\prime} S_{j}^{\prime}$

• Computed as the average of sampled states when the network is running “freely

### #Training

Step1

• For each training pattern $V_i$
• Fix the visible units to $V_i$
• Let the hidden neurons evolve from a random initial point to generate $H_i$
• Generate $S_i = [V_i,H_i]$
• Repeat K times to generate synthetic training

$$\mathbf{S}={S_{1,1}, S_{1,2}, \ldots, S_{1 K}, S_{2,1}, \ldots, S_{N, K}}$$

Step2

• Now unclamp the visible units and let the entire network evolve several times to generate

$$\mathbf{S}_{simul}=S_{simul, 1}, S_{simul, 2}, \ldots, S_{simul, M}$$

Gradients $$\frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}=\frac{1}{N K} \sum_{\boldsymbol{S}} s_{i} s_{j}-\frac{1}{M} \sum_{S_{i} \in \mathbf{S}_{\text {simul }}} s_{i}^{\prime} s_{j}^{\prime}$$

$$w_{i j}=w_{i j}-\eta \frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}$$

• Gradients are computed as before, except that the first term is now computed over the expanded training data

Issues

• Training takes for ever
• Doesn’t really work for large problems
• A small number of training instances over a small number of bits

## #Restricted Boltzmann Machines ◎ Restricted Boltzmann Machines
• Partition visible and hidden units
• Visible units ONLY talk to hidden units
• Hidden units ONLY talk to visible units

### #Training

Step1

• For each sample
• Anchor visible units
• Sample from hidden units
• No looping!!

Step2

• Now unclamp the visible units and let the entire network evolve several times to generate

$$\mathbf{S}_{simul}=S_{simul, 1}, S_{simul, 2}, \ldots, S_{simul, M}$$ ◎ Sampling
• For each sample
• Initialize $V_0$ (visible) to training instance value
• Iteratively generate hidden and visible units
• Gradient ◎ Training

$$\frac{\partial \log p(v)}{\partial w_{i j}}=<v_{i} h_{j}>^{0}-<v_{i} h_{j}>^{\infty}$$

### #A Shortcut: Contrastive Divergence

• Recall: Raise the neighborhood of each target memory
• Sufficient to run one iteration to give a good estimate of the gradient

$$\frac{\partial \log p(v)}{\partial w_{i j}}=< v_{i} h_{j}>^{0}-<v_{i} h_{j}>^{1}$$

Load Comments?