#KullbackLeibler divergence
#Information theory
 Quantify information of intuition^{[1]}
 Likely events should have low information content
 Less likely events should have higher information content
 Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
 Selfinformation
 $I(x)=\log P(x)$
 Deals only with a single outcome
 Shannon entropy
 $H(\mathrm{x})=\mathbb{E}_{\mathrm{x} \sim P}[I(x)]=\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)]$
 Quantify the amount of uncertainty in an entire probability distribution
#KL divergence and crossentropy
 Measure how different two distributions over the same random variable $x$
 $D_{\mathrm{KL}}(P  Q)=\mathbb{E}_{\mathrm{x} \sim P}\left[\log \frac{P(x)}{Q(x)}\right]=\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)\log Q(x)]$
 Properities
 Nonnegative. The KL divergence is 0 if and only if $P$ and $Q$ are the same distribution
 Not symmetric. $D_{\mathrm{KL}}(P  Q) \neq D_{\mathrm{KL}}(Q  P)$ for some $P$ and $Q$
 Crossentropy
 $H(P, Q)=H(P)+D_{\mathrm{KL}}(P  Q) = \mathbb{E}_{\mathrm{x} \sim P}[\log Q(x)]$
 Minimizing the crossentropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term
 In machine learning, $P$ represents real data distributioin, we need to compute the $Q$ distribution from model, which is why cross entropy is used.
 Meanwhile, min crossentropy is equal to maxmize Bernoulli loglikelihood
#TODO

negative loglikelihood

Focal loss

Neighborhood Component Analysis

mutual information

contrastive loss
 CVPR05 Learning a similarity metric discriminatively, with application to face verification
 a pair of either similar or dissimilar data points

triplet loss
 to learn a distance in which the anchor point is closer to the similar point than to the dissimilar one.

hinge loss / surrogate losses

Magnet loss
 Metric Learning with adaptive density discrimination