#Stability
 Will this necessarily be「Bounded Input Bounded Output」?
 Guaranteed if output and hidden activations are bounded
 But will it saturate？
#Analyzing Recursion

Sufficient to analyze the behavior of the hidden layer since it carries the relevant information

Assumed linear systems
 $$ z_{k}=W_{h} h_{k1}+W_{x} x_{k}, \quad h_{k}=z_{k} $$
 Sufficient to analyze the response to a single input at $t =0$ (else is zero input)
#Simple scalar linear recursion
 $h(t) = wh(t1) + cx(t)$
 $h_0(t) = w^tcx(0)$
 If $w > 1$ it will blow up
#Simple Vector linear recursion
 $h(t) = Wh(t1) + Cx(t)$
 $h_0(t) = W^tCx(0)$
 For any input, for large the length of the hidden vector will expand or contract according to the $t$ th power of the largest eigen value of the hiddenlayer weight matrix
 If $\lambda_{max} > 1$ it will blow up, otherwise it will contract and shrink to 0 rapidly
#Nonlinearities
 Sigmoid: Saturates in a limited number of steps, regardless of $w$
 To a value dependent only on $w$ (and bias, if any)
 Rate of saturation depends on $w$
 Tanh: Sensitive to $w$, but eventually saturates
 “Prefers” weights close to 1.0
 Relu: Sensitive to $w$, can blow up
#Lessons
 Recurrent networks retain information from the infinite past in principle
 In practice, they tend to blow up or forget
 If the largest Eigen value of the recurrent weights matrix is greater than 1, the network response may blow up
 If it’s less than one, the response dies down very quickly
 The “memory” of the network also depends on the parameters (and activation) of the hidden units
 Sigmoid activations saturate and the network becomes unable to retain new information
 RELU activations blow up or vanish rapidly
 Tanh activations are the most effective at storing memory
 And still has very short “memory”
 Still sensitive to Eigenvalues of $W$
#Vanishing gradient
 A particular problem with training deep networks is the gradient of the error with respect to weights is unstable
 For
 $\operatorname{Div}(X)=D\left(f_{N}\left(W_{N1} f_{N1}\left(W_{N2} f_{N2}\left(\ldots W_{0} X\right)\right)\right)\right)$
 We get
 $\nabla_{f_{k}} \operatorname{Div}=\nabla D . \nabla f_{N} . W_{N1} . \nabla f_{N1} . W_{N2} \ldots \nabla f_{k+1} W_{k}$
 Where
 $\nabla{f_{n}}$ is *jacobian* of $f_N()$ to its current input
#For activation
 For RNN
 $\nabla f_{t}\left(z_{i}\right)=\left[\begin{array}{cccc}f_{t, 1}^{\prime}\left(z_{1}\right) & 0 & \cdots & 0 \\ 0 & f_{t, 2}^{\prime}\left(z_{2}\right) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & f_{t, N}^{\prime}\left(z_{N}\right)\end{array}\right]$
 For vector activations: A full matrix
 For scalar activations: A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer
 The derivative (or subgradient) of the activation function is always bounded
 Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1
 Multiplication by the Jacobian is always a shrinking operation
 After a few layers the derivative of the divergence at any time is totally “forgotten”
#For weights
 In a singlelayer RNN, the weight matrices are identical
 The conclusion below holds for any deep network, though
 The chain product for $\nabla_{f_k} Div$ will
 Expand $\nabla D$ along directions in which the singular values of the weight matrices are greater than 1
 Shrink $\nabla D$ in directions where the singular values are less than 1
 Repeated multiplication by the weights matrix will result in Exploding or vanishing gradients
#LSTM
#Problem
 Recurrent nets are very deep nets
 Stuff gets forgotten in the forward pass too
 Each weights matrix and activation can shrink components of the input
 Need the longterm dependency
 The memory retention of the network depends on the behavior of the weights and jacobian
 Which in turn depends on the parameters $W$ rather than what it is trying to remember
 We need
 Not be directly dependent on vagaries of network parameters, but rather on inputbased determination of whether it must be remembered
 Retain memories until a switch based on the input flags them as ok to forget
 「Curly brace must remember until curly brace is closed」
 LSTM
 Address the problem of inputdependent memory behavior
#Architecture
 The $\sigma$ are multiplicative gates that decide if something is important or not
#Key component
Remembered cell state
 Mutiply is a switch
 Should I continue remember or not? (scale up / down)
 Acddition
 Should I agument the memory?
 $C_t$ is the linear history carried by the constanterror carousel
 Carries information through, only affected by a gate
 And addition of history, which too is gated..
Gates
 Gates are simple sigmoidal units with outputs in the range (0,1)
 Controls how much of the information is to be let through
#Forget gate
 The first gate determines whether to carry over the history or to forget it
 More precisely, how much of the history to carry over
 Also called the “forget” gate
 Note, we’re actually distinguishing between the cell memory $C$ and the state $h$ that is coming over time! They’re related though
 Hidden state is compute from memory (which is stored)
#Input gate
 The second input has two parts
 A perceptron layer that determines if there’s something new and interesting in the input
 「See a curly brace」
 A gate that decides if its worth remembering
 「Curly brace is in comment section, ignore it」
 A perceptron layer that determines if there’s something new and interesting in the input
#Memory cell update
 If something new and worth remembering
 Added to the current memory cell
#Output and Output gate
 The output of the cell
 Simply compress it with tanh to make it lie between 1 and 1
 Note that this compression no longer affects our ability to carry memory forward
 Controlled by an output gate
 To decide if the memory contents are worth reporting at this time
 Simply compress it with tanh to make it lie between 1 and 1
#The “Peephole” Connection
 The raw memory is informative by itself and can also be input
 Note, we’re using both $C$ and $h$
#Forward
#Backward^{}[1]
$$ \begin{array}{l} \nabla_{C_{t}} D i v=&\nabla_{h_{t}} D i v \circ\left(o_{t} \circ \tanh ^{\prime}(.)+\tanh (.) \circ \sigma^{\prime}(.) W_{C o}\right)+ \\\\ &\nabla_{C_{t+1}} D i v \circ\left(f_{t+1}+C_{t} \circ \sigma^{\prime}(.) W_{C f}+\tilde{C}_{t+1} \circ \sigma^{\prime}(.) W_{C i} \circ \tanh (.) \ldots\right) \end{array} $$ $$ \begin{aligned} \nabla_{h_{t}} D i v=& \nabla_{z_{t}} D i v \nabla_{h_{t}} z_{t}+\nabla_{C_{t+1}} D i v \circ\left(C_{t} \circ \sigma^{\prime}(.) W_{h f}+\tilde{C}_{t+1} \circ \sigma^{\prime}(.) W_{h i}\right)+\\\\ &\nabla_{C_{t+1}} D i v \circ o_{t+1} \circ \tanh ^{\prime}(.) W_{h i}+\nabla_{h_{t+1}} D i v \circ \tanh (.) \circ \sigma^{\prime}(.) W_{h o} \end{aligned} $$ And weights?#Gated Recurrent Units
 Combine forget and input gates
 In new input is to be remembered, then this means old memory is to be forgotten
 No need to compute twice
 Don’t bother to separately maintain compressed and regular memories
 Redundant representation
#Summary
 LSTMs are an alternative formalism where memory is made more directly dependent on the input, rather than network parameters/structure
 Through a “Constant Error Carousel” memory structure with no weights or activations, but instead direct switching and “increment/decrement” from pattern recognizers
 Do not suffer from a vanishing gradient problem but do suffer from exploding gradient issue