Generating Language Synthesis Input: symbols as one-hot vectors Dimensionality of the vector is the size of the 「vocabulary」 Projected down to lower-dimensional “embeddings” The hidden units are (one or more layers of) LSTM units Output at each time: A probability distribution that ideally assigns peak probability to the next word in the sequence Divergence $$ \operatorname{Div}(\mathbf{Y}_{\text {target}}(1 \ldots T), \mathbf{Y}(1 \ldots T))=\sum_{t}\operatorname{Xent}(\mathbf{Y}_{\text {target}}(t), \mathbf{Y}(t))=-\sum_{t} \log Y(t, w_{t+1}) $$

Sequence to sequence Sequence goes in, sequence comes out No notion of “time synchrony” between input and output May even nots maintain order of symbols (from one language to another) With order synchrony The input and output sequences happen in the same order Although they may be time asynchronous E.g. Speech recognition The input speech corresponds to the phoneme sequence output Question How do we know when to output symbols In fact, the network produces outputs at every time Which of these are the real outputs?

Variants on recurrent nets Architectures How to train recurrent networks of different architectures Synchrony The target output is time-synchronous with the input The target output is order-synchronous, but not time synchronous One to one No recurrence in model
Exactly as many outputs as inputs One to one correspondence between desired output and actual output Common assumption $$ \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right) $$

Stability Will this necessarily be「Bounded Input Bounded Output」? Guaranteed if output and hidden activations are bounded But will it saturate？ Analyzing Recursion Sufficient to analyze the behavior of the hidden layer since it carries the relevant information
Assumed linear systems
$$ z_{k}=W_{h} h_{k-1}+W_{x} x_{k}, \quad h_{k}=z_{k} $$ Sufficient to analyze the response to a single input at $t =0$ (else is zero input) Simple scalar linear recursion $h(t) = wh(t-1) + cx(t)$ $h_0(t) = w^tcx(0)$ If $w > 1$ it will blow up Simple Vector linear recursion $h(t) = Wh(t-1) + Cx(t)$ $h_0(t) = W^tCx(0)$ For any input, for large the length of the hidden vector will expand or contract according to the $t-$ th power of the largest eigen value of the hidden-layer weight matrix If $|\lambda_{max} > 1|$ it will blow up, otherwise it will contract and shrink to 0 rapidly Non-linearities Sigmoid: Saturates in a limited number of steps, regardless of $w$ To a value dependent only on $w$ (and bias, if any) Rate of saturation depends on $w$ Tanh: Sensitive to $w$, but eventually saturates “Prefers” weights close to 1.

Modelling Series In many situations one must consider a series of inputs to produce an output
Outputs too may be a series Finite response model
Can use convolutional neural net applied to series data (slide)
Also called a Time-Delay neural network Something that happens today only affects the output of the system for days into the future

Convolution Each position in $z$ consists of convolution result in previous map Way for shrinking the maps Stride greater than 1 Downsampling (not necessary) Typically performed with strides > 1 Pooling Maxpooling Note: keep tracking of location of max (needed while back prop) Mean pooling Learning the CNN Training is as in the case of the regular MLP The only difference is in the structure of the network Define a divergence between the desired output and true output of the network in response to any input Network parameters are trained through variants of gradient descent Gradients are computed through backpropagation Final flat layers Backpropagation continues in the usual manner until the computation of the derivative of the divergence Recall in Backpropagation Step 1: compute $\frac{\partial Div}{\partial z^{n}}$、$\frac{\partial Div}{\partial y^{n}}$ Step 2: compute $\frac{\partial Div}{\partial w^{n}}$ according to step 1 Convolutional layer Computing $\nabla_{Z(l)} D i v$ $$ \frac{d D i v}{d z(l, m, x, y)}=\frac{d D i v}{d Y(l, m, x, y)} f^{\prime}(z(l, m, x, y)) $$

Architecture A convolutional neural network comprises “convolutional” and “downsampling ” layers Convolutional layers comprise neurons that scan their input for patterns Downsampling layers perform max operations on groups of outputs from the convolutional layers Perform on individual map For reduce the number of parameters The two may occur in any sequence, but typically they alternate Followed by an MLP with one or more layers A convolutional layer Each activation map has two components An affine map, obtained by convolution over maps in the previous layer Each affine map has, associated with it, a learnable filter An activation that operates on the output of the convolution What is a convolution Scanning an image with a “filter” Equivalent to scanning with an MLP Weights size of the filter $\times$ no.

Cascade-Correlation Algorithm Start with direct I/O connections only. No hidden units. Train output-layer weights using BP or Quickprop. If error is now acceptable, quit. Else, Create one new hidden unit offline. Create a pool of candidate units. Each gets all available inputs. Outputs are not yet connected to anything. Train the incoming weights to maximize the match (covariance) between each unit’s output and the residual error: When all are quiescent, tenure the winner and add it to active net.

Movivation Find a word in a signal of find a item in picture The need for shift invariance The location of a pattern is not important So we can scan with a same MLP for the pattern Just one giant network Restriction: All subnets are identical Regular networks vs. scanning networks In a regular MLP every neuron in a layer is connected by a unique weight to every unit in the previous layer In a scanning MLP each neuron is connected to a subset of neurons in the previous layer The weights matrix is sparse The weights matrix is block structured with identical blocks The network is a shared-parameter model Modifications Order changed Intuitivly, scan at one position and get output, then scan next place But we can also first scan all the position at one layer, then the next layer The result is the same Distrubuting the scan Evaluate small pattern in the first layer The higher layer implicitly learns the arrangement of sub patterns that represents the larger pattern Why distribute?

Optimizers Momentum and Nestorov’s method improve convergence by normalizing the mean (first moment) of the derivatives Considering the second moments RMS Prop / Adagrad / AdaDelta / ADAM1 Simple gradient and momentum methods still demonstrate oscillatory behavior in some directions2 Depends on magic step size parameters (learning rate) Need to dampen step size in directions with high motion Second order term (use variation to smooth it) Scale down updates with large mean squared derivatives scale up updates with small mean squared derivatives RMS Prop Notion The squared derivative is $\partial_{w}^{2} D=\left(\partial_{w} D\right)^{2}$ The mean squared derivative is $E\left[\partial_{W}^{2} D\right]$ This is a variant on the basic mini-batch SGD algorithm Updates are by parameter $$ E\left[\partial_{w}^{2} D\right]_{k}=\gamma E\left[\partial_{w}^{2} D\right]_{k-1}+(1-\gamma)\left(\partial_{w}^{2} D\right)_{k} $$