KullbackLeibler divergence
Information theory
 Quantify information of intuition^{1}
 Likely events should have low information content
 Less likely events should have higher information content
 Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
 Selfinformation
 $I(x)=\log P(x)$
 Deals only with a single outcome
 Shannon entropy
 $H(\mathrm{x})=\mathbb{E}_{\mathrm{x} \sim P}[I(x)]=\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)]$
 Quantify the amount of uncertainty in an entire probability distribution
KL divergence and crossentropy
 Measure how different two distributions over the same random variable $x$
 $D_{\mathrm{KL}}(P  Q)=\mathbb{E}{\mathrm{x} \sim P}\left[\log \frac{P(x)}{Q(x)}\right]=\mathbb{E}{\mathrm{x} \sim P}[\log P(x)\log Q(x)]$
 Properities
 Nonnegative. The KL divergence is 0 if and only if $P$ and $Q$ are the same distribution
 Not symmetric. $D_{\mathrm{KL}}(P  Q) \neq D_{\mathrm{KL}}(Q  P)$ for some $P$ and $Q$
 Crossentropy
 $H(P, Q)=H(P)+D_{\mathrm{KL}}(P  Q) = \mathbb{E}_{\mathrm{x} \sim P}[\log Q(x)]$
 Minimizing the crossentropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term
 In machine learning, $P$ represents real data distributioin, we need to compute the $Q$ distribution from model, which is why cross entropy is used.
 Meanwhile, min crossentropy is equal to maxmize Bernoulli loglikelihood
TODO

negative loglikelihood

Focal loss

Neighborhood Component Analysis

mutual information

contrastive loss
 CVPR05 Learning a similarity metric discriminatively, with application to face verification
 a pair of either similar or dissimilar data points

triplet loss
 to learn a distance in which the anchor point is closer to the similar point than to the dissimilar one.

hinge loss / surrogate losses

Magnet loss
 Metric Learning with adaptive density discrimination

Deep Learning, charpter 3. Ian Goodfellow and Yoshua Bengio and Aaron Courville ↩︎