Kullback-Leibler divergence

Information theory

  • Quantify information of intuition1
    • Likely events should have low information content
    • Less likely events should have higher information content
    • Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
  • Self-information
    • $I(x)=-\log P(x)$
    • Deals only with a single outcome
  • Shannon entropy
    • $H(\mathrm{x})=\mathbb{E}_{\mathrm{x} \sim P}[I(x)]=-\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)]$
    • Quantify the amount of uncertainty in an entire probability distribution

KL divergence and cross-entropy

  • Measure how different two distributions over the same random variable $x$
    • $D_{\mathrm{KL}}(P | Q)=\mathbb{E}{\mathrm{x} \sim P}\left[\log \frac{P(x)}{Q(x)}\right]=\mathbb{E}{\mathrm{x} \sim P}[\log P(x)-\log Q(x)]$
  • Properities
    • Non-negative. The KL divergence is 0 if and only if $P$ and $Q$ are the same distribution
    • Not symmetric. $D_{\mathrm{KL}}(P | Q) \neq D_{\mathrm{KL}}(Q | P)$ for some $P$ and $Q$
  • Cross-entropy
    • $H(P, Q)=H(P)+D_{\mathrm{KL}}(P | Q) = -\mathbb{E}_{\mathrm{x} \sim P}[\log Q(x)]$
    • Minimizing the cross-entropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term
      • In machine learning, $P$ represents real data distributioin, we need to compute the $Q$ distribution from model, which is why cross entropy is used.
    • Meanwhile, min cross-entropy is equal to maxmize Bernoulli log-likelihood


  • negative log-likelihood

  • Focal loss

  • Noise Contrastive Estimation (NCE)

  • Neighborhood Component Analysis

  • mutual information

  • contrastive loss

    • CVPR05 Learning a similarity metric discriminatively, with application to face verification
    • a pair of either similar or dissimilar data points
  • triplet loss

    • to learn a distance in which the anchor point is closer to the similar point than to the dissimilar one.
  • hinge loss / surrogate losses

  • Magnet loss

    • Metric Learning with adaptive density discrimination

  1. Deep Learning, charpter 3. Ian Goodfellow and Yoshua Bengio and Aaron Courville ↩︎