Kullback-Leibler divergence
Information theory
- Quantify information of intuition1
- Likely events should have low information content
- Less likely events should have higher information content
- Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
- Self-information
- $I(x)=-\log P(x)$
- Deals only with a single outcome
- Shannon entropy
- $H(\mathrm{x})=\mathbb{E}_{\mathrm{x} \sim P}[I(x)]=-\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)]$
- Quantify the amount of uncertainty in an entire probability distribution
KL divergence and cross-entropy
- Measure how different two distributions over the same random variable $x$
- $D_{\mathrm{KL}}(P | Q)=\mathbb{E}{\mathrm{x} \sim P}\left[\log \frac{P(x)}{Q(x)}\right]=\mathbb{E}{\mathrm{x} \sim P}[\log P(x)-\log Q(x)]$
- Properities
- Non-negative. The KL divergence is 0 if and only if $P$ and $Q$ are the same distribution
- Not symmetric. $D_{\mathrm{KL}}(P | Q) \neq D_{\mathrm{KL}}(Q | P)$ for some $P$ and $Q$
-
- Cross-entropy
- $H(P, Q)=H(P)+D_{\mathrm{KL}}(P | Q) = -\mathbb{E}_{\mathrm{x} \sim P}[\log Q(x)]$
- Minimizing the cross-entropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term
- In machine learning, $P$ represents real data distributioin, we need to compute the $Q$ distribution from model, which is why cross entropy is used.
- Meanwhile, min cross-entropy is equal to maxmize Bernoulli log-likelihood
TODO
-
negative log-likelihood
-
Focal loss
-
Neighborhood Component Analysis
-
mutual information
-
contrastive loss
- CVPR05 Learning a similarity metric discriminatively, with application to face verification
- a pair of either similar or dissimilar data points
-
triplet loss
- to learn a distance in which the anchor point is closer to the similar point than to the dissimilar one.
-
hinge loss / surrogate losses
-
Magnet loss
- Metric Learning with adaptive density discrimination
-
Deep Learning, charpter 3. Ian Goodfellow and Yoshua Bengio and Aaron Courville ↩︎