As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be

$$H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.$$

Kullback and Leibler defined a similar measure now known as *KL divergence*. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.

$$D_{\text{KL}}(p\ | q)=\sum_i p_i \log \frac{p_i}{q_i}.$$

$D_\text{KL}$ is non-negative and zero if and only if $ p_i = q_i $ for all $i$. However, it is important to note that it is not in general symmetric:

$$ D_{\text{KL}}(p| q) \neq D_{\text{KL}}(q| p).$$

Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.

Algebraically, we can rewrite the definition as

$$ \begin{array}{rl} D_{\text{KL}}(p| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array} $$

KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called *cross entropy*:

$$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$

We could alternatively use this relationship to define cross entropy as:

$$H(p, q)=H(p) + D_\text{KL}(p| q).$$

Intuitively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p | q)$ differ by a constant factor for all $q$.