## Kullback-Leibler divergence#

### Information theory#

• Quantify information of intuition1
• Likely events should have low information content
• Less likely events should have higher information content
• Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
• Self-information
• $I(x)=-\log P(x)$
• Deals only with a single outcome
• Shannon entropy
• $H(\mathrm{x})=\mathbb{E}_{\mathrm{x} \sim P}[I(x)]=-\mathbb{E}_{\mathrm{x} \sim P}[\log P(x)]$
• Quantify the amount of uncertainty in an entire probability distribution

### KL divergence and cross-entropy#

• Measure how different two distributions over the same random variable $x$
• $D_{\mathrm{KL}}(P | Q)=\mathbb{E}{\mathrm{x} \sim P}\left[\log \frac{P(x)}{Q(x)}\right]=\mathbb{E}{\mathrm{x} \sim P}[\log P(x)-\log Q(x)]$
• Properities
• Non-negative. The KL divergence is 0 if and only if $P$ and $Q$ are the same distribution
• Not symmetric. $D_{\mathrm{KL}}(P | Q) \neq D_{\mathrm{KL}}(Q | P)$ for some $P$ and $Q$
• Cross-entropy
• $H(P, Q)=H(P)+D_{\mathrm{KL}}(P | Q) = -\mathbb{E}_{\mathrm{x} \sim P}[\log Q(x)]$
• Minimizing the cross-entropy with respect to $Q$ is equivalent to minimizing the KL divergence, because $Q$ does not participate in the omitted term
• In machine learning, $P$ represents real data distributioin, we need to compute the $Q$ distribution from model, which is why cross entropy is used.
• Meanwhile, min cross-entropy is equal to maxmize Bernoulli log-likelihood

## TODO#

• negative log-likelihood

• Focal loss

• Noise Contrastive Estimation (NCE)

• Neighborhood Component Analysis

• mutual information

• contrastive loss

• CVPR05 Learning a similarity metric discriminatively, with application to face verification
• a pair of either similar or dissimilar data points
• triplet loss

• to learn a distance in which the anchor point is closer to the similar point than to the dissimilar one.
• hinge loss / surrogate losses

• Magnet loss

• Metric Learning with adaptive density discrimination

1. Deep Learning, charpter 3. Ian Goodfellow and Yoshua Bengio and Aaron Courville ↩︎