As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be
$$H(p)=H(p\_1,p\_2,\ldots,p\_n)=-\sum\_{i}p\_i \log p\_i.$$Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.
$$D_{\text{KL}}(p\ | q)=\sum_i p_i \log \frac{p_i}{q_i}.$$$D_\text{KL}$ is non-negative and zero if and only if $ p_i = q_i $ for all $i$. However, it is important to note that it is not in general symmetric:
$$ D_{\text{KL}}(p\| q) \neq D_{\text{KL}}(q\| p).$$Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p\| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p\| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.
Algebraically, we can rewrite the definition as
$$ \begin{array}{rl} D_{\text{KL}}(p\| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\\\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\\\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\\\ &=- \sum_i p_i \log q_i-H(p) \\\\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\\\ \end{array} $$KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called cross entropy:
$$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$We could alternatively use this relationship to define cross entropy as:
$$H(p, q)=H(p) + D_\text{KL}(p\| q).$$Intuitively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p \| q)$ differ by a constant factor for all $q$.