As we saw in an earlier post, the entropy of a discrete probability distribution is defined to be
$$H(p)=H(p_1,p_2,\ldots,p_n)=-\sum_{i}p_i \log p_i.$$
Kullback and Leibler defined a similar measure now known as KL divergence. This measure quantifies how similar a probability distribution $p$ is to a candidate distribution $q$.
$$D_{\text{KL}}(p\ | q)=\sum_i p_i \log \frac{p_i}{q_i}.$$
$D_\text{KL}$ is non-negative and zero if and only if $ p_i = q_i $ for all $i$. However, it is important to note that it is not in general symmetric:
$$ D_{\text{KL}}(p| q) \neq D_{\text{KL}}(q| p).$$
Jonathon Shlens explains that KL Divergence can be interpreted as measuring the likelihood that samples represented by the empirical distribution $p$ were generated by a fixed distribution $q$. If $D_{\text{KL}}(p| q)=0$, we can guarantee that $p$ is generated by $q$. As $D_{\text{KL}}(p| q)\rightarrow\infty$, we can say that it is increasingly unlikely that $p$ was generated by $q$.
Algebraically, we can rewrite the definition as
$$ \begin{array}{rl} D_{\text{KL}}(p| q) &=\sum_i p_i \log \frac{p_i}{q_i} \\ &=\sum_i \left ( - p_i \log q_i + p_i \log p_i \right)\\ &=- \sum_i p_i \log q_i + \sum_i p_i \log p_i \\ &=- \sum_i p_i \log q_i - \sum_i p_i \log \frac{1}{p_i} \\ &=- \sum_i p_i \log q_i-H(p) \\ &=\sum_i p_i \log \frac{1}{q_i}-H(p)\\ \end{array} $$
KL Divergence breaks down as something that looks similar to entropy (but combining $p$ and $q$) minus the entropy of $p$. This first term is often called cross entropy:
$$H(p, q)=\sum_i p_i \log \frac{1}{q_i}.$$
We could alternatively use this relationship to define cross entropy as:
$$H(p, q)=H(p) + D_\text{KL}(p| q).$$
Intuitively, the cross entropy is the uncertainty implicit in $H(p)$ plus the likelihood that $p$ could have be generated by $q$. If we consider $p$ to be a fixed distribution, $H(p, q)$ and $D_\text{KL}(p | q)$ differ by a constant factor for all $q$.