Divergence of RNN

#Variants on recurrent nets

  • Architectures
    • How to train recurrent networks of different architectures
  • Synchrony
    • The target output is time-synchronous with the input
    • The target output is order-synchronous, but not time synchronous

#One to one

  • No recurrence in model

    • Exactly as many outputs as inputs
    • One to one correspondence between desired output and actual output
  • Common assumption $$ \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(1 \ldots T), Y(1 \ldots T)\right)=w_{t} \nabla_{Y(t)} \operatorname{Div}\left(Y_{\text {target}}(t), Y(t)\right) $$

    • $w_t$ is typically set to 1.0

#Many to many

  • The divergence computed is between the sequence of outputs by the network and the desired sequence of outputs
  • This is not just the sum of the divergences at individual times

#Language modelling: Representing words

  • Represent words as one-hot vectors

    • Sparse problem
    • Makes no assumptions about the relative importance of words
  • The Projected word vectors

    • Replace every one-hot vector $W_i$ by $PW_i$
    • $P$ is an $M\times N$ matrix
  • How to learn projections

    • Soft bag of words
      • Predict word based on words in immediate context
      • Without considering specific position
    • Skip-grams
      • Predict adjacent words based on current word
◎ Generating Language

#Many to one

  • Example
    • Question answering
      • Input : Sequence of words
      • Output: Answer at the end of the question
    • Speech recognition
      • Input : Sequence of feature vectors (e.g. Mel spectra)
      • Output: Phoneme ID at the end of the sequence
  • Outputs are actually produced for every input

    • We only read it at the end of the sequence
  • How to train

    • Define the divergence everywhere
      • $D I V\left(Y_{\text {target}}, Y\right)=\sum_{t} w_{t} \operatorname{Xent}(Y(t), \text { Phoneme})$
    • Typical weighting scheme for speech
      • All are equally important
    • Problem like question answering
      • Answer only expected after the question ends


  • How do we know when to output symbols
    • In fact, the network produces outputs at every time
    • Which of these are the real outputs
      • Outputs that represent the definitive occurrence of a symbol
  • Option 1: Simply select the most probable symbol at each time
    • Merge adjacent repeated symbols, and place the actual emission of the symbol in the final instant
    • Cannot distinguish between an extended symbol and repetitions of the symbol
    • Resulting sequence may be meaningless
  • Option 2: Impose external constraints on what sequences are allowed
    • Only allow sequences corresponding to dictionary words
    • Sub-symbol units
  • How to train when no timing information provided
  • Only the sequence of output symbols is provided for the training data
    • But no indication of which one occurs where
  • How do we compute the divergence?
    • And how do we compute its gradient
Load Comments?