This lecture introduced EM algorithm: an iterative technique to estimate probaility models for missing data. Meanwhile, mixture Gaussian, PCA and factor analysis are actually generative models in a way of EM.

This lecture redefined a regular Hopfield net as a stochastic system: Boltzmann machines. And talked about the training, sampling issues of Boltzmann machines model, introduced Restricted Boltzmann Machines, which is a common used model in practice.

Training hopfield nets Geometric approach Behavior of $\mathbf{E}(\mathbf{y})=\mathbf{y}^{T} \mathbf{W y}$ with $\mathbf{W}=\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I}$ is identical to behavior with $W=YY^T$
Energy landscape only differs by an additive constant Gradients and location of minima remain same (Have the same eigen vectors) Sine : $\mathbf{y}^{T}\left(\mathbf{Y} \mathbf{Y}^{T}-N_{p} \mathbf{I}\right) \mathbf{y}=\mathbf{y}^{T} \mathbf{Y} \mathbf{Y}^{T} \mathbf{y}-N N_{p}$
We use $\mathbf{y}^{T} \mathbf{Y} \mathbf{Y}^{T} \mathbf{y}$ for analyze

Self-Supervised Representation Learning
Broadly speaking, all the generative models can be considered as self-supervised, but with different goals:
Generative models focus on creating diverse and realistic images While self-supervised representation learning care about producing good features generally helpful for many tasks Image based
Distortion
Exemplar-CNN (Dosovitskiy et al., 2015) Rotation of an entire image (Gidaris et al.

Hopfield Net So far, neural networks for computation are all feedforward structures Loopy network Each neuron is a perceptron with +1/-1 output Every neuron receives input from every other neuron Every neuron outputs signals to every other neuron At each time each neuron receives a “field” $\sum_{j \neq i} w_{j i} y_{j}+b_{i}$ If the sign of the field matches its own sign, it does not respond If the sign of the field opposes its own sign, it “flips” to match the sign of the field If the sign of the field at any neuron opposes its own sign, it “flips” to match the field Which will change the field at other nodes Which may then flip.

Kullback-Leibler divergence Information theory Quantify information of intuition1 Likely events should have low information content Less likely events should have higher information content Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come…

Logistic regression This the perceptron with a sigmoid activation It actually computes the probability that the input belongs to class 1 Decision boundaries may be obtained by comparing the probability to a threshold These boundaries will be lines (hyperplanes in higher dimensions) The sigmoid perceptron is a linear classifier Estimating the model Given: Training data: $\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)$ $X$ are vectors, $y$ are binary (0/1) class values Total probability of data $$ \begin{array}{l} P\left(\left(X_{1}, y_{1}\right),\left(X_{2}, y_{2}\right), \ldots,\left(X_{N}, y_{N}\right)\right)= \prod_{i} P\left(X_{i}, y_{i}\right) \\

Generating Language Synthesis Input: symbols as one-hot vectors Dimensionality of the vector is the size of the 「vocabulary」 Projected down to lower-dimensional “embeddings” The hidden units are (one or more layers of) LSTM units Output at each time: A probability distribution that ideally assigns peak probability to the next word in the sequence Divergence $$ \operatorname{Div}(\mathbf{Y}_{\text {target}}(1 \ldots T), \mathbf{Y}(1 \ldots T))=\sum_{t}\operatorname{Xent}(\mathbf{Y}_{\text {target}}(t), \mathbf{Y}(t))=-\sum_{t} \log Y(t, w_{t+1}) $$