#Variants on recurrent nets
 Architectures
 How to train recurrent networks of different architectures
 Synchrony
 The target output is timesynchronous with the input
 The target output is ordersynchronous, but not time synchronous
#One to one

No recurrence in model
 Exactly as many outputs as inputs
 One to one correspondence between desired output and actual output

Common assumption $$ \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right) $$
 $w_t$ is typically set to 1.0
#Many to many
 The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
 This is not just the sum of the divergences at individual times
#Language modelling: Representing words

Represent words as onehot vectors
 Sparse problem
 Makes no assumptions about the relative importance of words

The Projected word vectors
 Replace every onehot vector $W_i$ by $PW_i$
 $P$ is an $M\times N$ matrix

How to learn projections
 Soft bag of words
 Predict word based on words in immediate context
 Without considering specific position
 Skipgrams
 Predict adjacent words based on current word
 Soft bag of words
#Many to one
 Example
 Question answering
 Input : Sequence of words
 Output: Answer at the end of the question
 Speech recognition
 Input : Sequence of feature vectors (e.g. Mel spectra)
 Output: Phoneme ID at the end of the sequence
 Question answering

Outputs are actually produced for every input
 We only read it at the end of the sequence

How to train
 Define the divergence everywhere
 $D I V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme})$
 Typical weighting scheme for speech
 All are equally important
 Problem like question answering
 Answer only expected after the question ends
 Define the divergence everywhere
#Sequencetosequence
 How do we know when to output symbols
 In fact, the network produces outputs at every time
 Which of these are the real outputs
 Outputs that represent the definitive occurrence of a symbol
 Option 1: Simply select the most probable symbol at each time
 Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
 Cannot distinguish between an extended symbol and repetitions of the symbol
 Resulting sequence may be meaningless
 Option 2: Impose external constraints on what sequences are allowed
 Only allow sequences corresponding to dictionary words
 Subsymbol units
 How to train when no timing information provided
 Only the sequence of output symbols is provided for the training data
 But no indication of which one occurs where
 How do we compute the divergence?
 And how do we compute its gradient