Cross Entropy and KL Divergence

As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be

\[H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.\]

Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution \(p\) is to a candidate distribution \(q\).

\[D_{\text{KL}}(p\ | q)=\sum_i p_i \log \frac{p_i}{q_i}.\]

\(D_\text{KL}\) is non-negative and zero if and only if \( p_i = q_i \) for all \(i\). However, it is important to note that it is not in general symmetric:

\[ D_{\text{KL}}(p| q) \neq D_{\text{KL}}(q| p).\]

Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution \(p\) were generated by a fixed distribution \(q\). If \(D_{\text{KL}}(p| q)=0\), we can guarantee that \(p\) is generated by \(q\). As \(D_{\text{KL}}(p| q)\rightarrow\infty\), we can say that it is increasingly unlikely that \(p\) was generated by \(q\).

Algebraically, we can rewrite the definition as

\[ \begin{array}{rl} D_{\text{KL}}(p| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array} \]

KL Divergence breaks down as something that looks similar to entropy (but combining \(p\) and \(q\)) minus the entropy of \(p\). This first term is often called cross entropy:

\[H(p, q)=\sum_i p_i \log \frac{1}{q_i}.\]

We could alternatively use this relationship to define cross entropy as:

\[H(p, q)=H(p) + D_\text{KL}(p| q).\]

Intuatively, the cross entropy is the uncertainty implicit in \(H(p)\) plus the likelihood that \(p\) could have be generated by \(q\). If we consider \(p\) to be a fixed distribution, \(H(p, q)\) and \(D_\text{KL}(p | q)\) differ by a constant factor for all \(q\).