## Seq2seq and attention model

Generating Language Synthesis Input: symbols as one-hot vectors Dimensionality of the vector is the size of the 「vocabulary」 Projected down to lower-dimensional “embeddings” The hidden units are (one or more layers of) LSTM units Output at each time: A probability distribution that ideally assigns peak probability to the next word in the sequence Divergence $$\operatorname{Div}(\mathbf{Y}_{\text {target}}(1 \ldots T), \mathbf{Y}(1 \ldots T))=\sum_{t}\operatorname{Xent}(\mathbf{Y}_{\text {target}}(t), \mathbf{Y}(t))=-\sum_{t} \log Y(t, w_{t+1})$$

## Connectionist Temporal Classification

Sequence to sequence Sequence goes in, sequence comes out No notion of “time synchrony” between input and output May even nots maintain order of symbols (from one language to another) With order synchrony The input and output sequences happen in the same order Although they may be time asynchronous E.g. Speech recognition The input speech corresponds to the phoneme sequence output Question How do we know when to output symbols In fact, the network produces outputs at every time Which of these are the real outputs?

## Divergence of RNN

Variants on recurrent nets Architectures How to train recurrent networks of different architectures Synchrony The target output is time-synchronous with the input The target output is order-synchronous, but not time synchronous One to one No recurrence in model Exactly as many outputs as inputs One to one correspondence between desired output and actual output Common assumption $$\nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right)$$

## Stability analysis and LSTMs

Stability Will this necessarily be「Bounded Input Bounded Output」? Guaranteed if output and hidden activations are bounded But will it saturate？ Analyzing Recursion Sufficient to analyze the behavior of the hidden layer since it carries the relevant information Assumed linear systems $$z_{k}=W_{h} h_{k-1}+W_{x} x_{k}, \quad h_{k}=z_{k}$$ Sufficient to analyze the response to a single input at $t =0$ (else is zero input) Simple scalar linear recursion $h(t) = wh(t-1) + cx(t)$ $h_0(t) = w^tcx(0)$ If $w > 1$ it will blow up Simple Vector linear recursion $h(t) = Wh(t-1) + Cx(t)$ $h_0(t) = W^tCx(0)$ For any input, for large the length of the hidden vector will expand or contract according to the $t-$ th power of the largest eigen value of the hidden-layer weight matrix If $|\lambda_{max} > 1|$ it will blow up, otherwise it will contract and shrink to 0 rapidly Non-linearities Sigmoid: Saturates in a limited number of steps, regardless of $w$ To a value dependent only on $w$ (and bias, if any) Rate of saturation depends on $w$ Tanh: Sensitive to $w$, but eventually saturates “Prefers” weights close to 1.

## Recurrent Networks

Modelling Series In many situations one must consider a series of inputs to produce an output Outputs too may be a series Finite response model Can use convolutional neural net applied to series data (slide) Also called a Time-Delay neural network Something that happens today only affects the output of the system for days into the future

## Back propagation through a CNN

Convolution Each position in $z$ consists of convolution result in previous map Way for shrinking the maps Stride greater than 1 Downsampling (not necessary) Typically performed with strides > 1 Pooling Maxpooling Note: keep tracking of location of max (needed while back prop) Mean pooling Learning the CNN Training is as in the case of the regular MLP The only difference is in the structure of the network Define a divergence between the desired output and true output of the network in response to any input Network parameters are trained through variants of gradient descent Gradients are computed through backpropagation Final flat layers Backpropagation continues in the usual manner until the computation of the derivative of the divergence Recall in Backpropagation Step 1: compute $\frac{\partial Div}{\partial z^{n}}$、$\frac{\partial Div}{\partial y^{n}}$ Step 2: compute $\frac{\partial Div}{\partial w^{n}}$ according to step 1 Convolutional layer Computing $\nabla_{Z(l)} D i v$ $$\frac{d D i v}{d z(l, m, x, y)}=\frac{d D i v}{d Y(l, m, x, y)} f^{\prime}(z(l, m, x, y))$$

## CNN architecture

Architecture A convolutional neural network comprises “convolutional” and “downsampling ” layers Convolutional layers comprise neurons that scan their input for patterns Downsampling layers perform max operations on groups of outputs from the convolutional layers Perform on individual map For reduce the number of parameters The two may occur in any sequence, but typically they alternate Followed by an MLP with one or more layers A convolutional layer Each activation map has two components An affine map, obtained by convolution over maps in the previous layer Each affine map has, associated with it, a learnable filter An activation that operates on the output of the convolution What is a convolution Scanning an image with a “filter” Equivalent to scanning with an MLP Weights size of the filter $\times$ no.

Optimizers Momentum and Nestorov’s method improve convergence by normalizing the mean (first moment) of the derivatives Considering the second moments RMS Prop / Adagrad / AdaDelta / ADAM1 Simple gradient and momentum methods still demonstrate oscillatory behavior in some directions2 Depends on magic step size parameters (learning rate) Need to dampen step size in directions with high motion Second order term (use variation to smooth it) Scale down updates with large mean squared derivatives scale up updates with small mean squared derivatives RMS Prop Notion The squared derivative is $\partial_{w}^{2} D=\left(\partial_{w} D\right)^{2}$ The mean squared derivative is $E\left[\partial_{W}^{2} D\right]$ This is a variant on the basic mini-batch SGD algorithm Updates are by parameter $$E\left[\partial_{w}^{2} D\right]_{k}=\gamma E\left[\partial_{w}^{2} D\right]_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k}$$