Cross Entropy and KL Divergence

As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be

$H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.$

Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution $$p$$ is to a candidate distribution $$q$$.

$D_{\text{KL}}(p\ | q)=\sum_i p_i \log \frac{p_i}{q_i}.$

$$D_\text{KL}$$ is non-negative and zero if and only if $$p_i = q_i$$ for all $$i$$. However, it is important to note that it is not in general symmetric:

$D_{\text{KL}}(p| q) \neq D_{\text{KL}}(q| p).$

Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $$p$$ were generated by a fixed distribution $$q$$. If $$D_{\text{KL}}(p| q)=0$$, we can guarantee that $$p$$ is generated by $$q$$. As $$D_{\text{KL}}(p| q)\rightarrow\infty$$, we can say that it is increasingly unlikely that $$p$$ was generated by $$q$$.

Algebraically, we can rewrite the definition as

$\begin{array}{rl} D_{\text{KL}}(p| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array}$

KL Divergence breaks down as something that looks similar to entropy (but combining $$p$$ and $$q$$) minus the entropy of $$p$$. This first term is often called cross entropy:

$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$

We could alternatively use this relationship to define cross entropy as:

$H(p, q)=H(p) + D_\text{KL}(p| q).$

Intuatively, the cross entropy is the uncertainty implicit in $$H(p)$$ plus the likelihood that $$p$$ could have be generated by $$q$$. If we consider $$p$$ to be a fixed distribution, $$H(p, q)$$ and $$D_\text{KL}(p | q)$$ differ by a constant factor for all $$q$$.