## Entropy of a Discrete Probability Distribution

### Explaining how Claude Shannon quantified the uncertainty of discrete probability distributions

Suppose we have a discrete set of possible events $$1,\ldots, n$$ that occur with probabilities $$(p_1, p_2, \ldots, p_n)$$. Claude Shannon asked the question

Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome?

For example, suppose we have a coin that lands on heads with probability $$p$$ and tails with probability $$1-p$$. If $$p=1$$, the coin always lands on heads. Since there is no uncertainty, we might want to say the uncertainty is 0. However, if the coin is fair and $$p=0.5$$, we maximize our uncertainty: it’s a complete tossup whether the coin is heads or tails. We might want to say the uncertainty in this case is 1.

In general, Shannon wanted to devise a function $$H(p_1, p_2, \ldots, p_n)$$ describing the uncertainty of an arbitrary set of discrete events (i.e. a $$n$$-sided die). He thought that “it is reasonable” that $$H$$ should have three properties:

1. $$H$$ should be a continuous function of each $$p_i$$. A small change in a single probability should result in a similarly small entropy (uncertainty) change.

2. If each event is equally likely ($$p_i=\frac{1}{n}$$), $$H$$ should increase as a function of $$n$$: the more events there are, the more uncertain we are.

3. Finally, entropy should be additive for independent events. Suppose we generate a random variable $$x$$ by the following process: Flip a fair coin. If it is heads, $$x=0$$. However, if the flip was tails, flip the coin again (an independent event from the first flip). If the second flip is heads, $$x=1$$, if tails $$x=2$$. These three outcomes occur with probability $1/2$, $1/4$, and $1/4$, respectively.

We can compute the entropy of $x$ as $$H(p_0=1/2, p_1=1/4, p_2=1/4)$$. By the independence property, this relationship holds:

$H\left(\frac{1}{2}, \frac{1}{4}, \frac{1}{4}\right)=H\left(\frac{1}{2}, \frac{1}{2}\right) + \frac{1}{2} H\left(\frac{1}{2}, \frac{1}{2}\right).$

As David MacKay explains, this is the general claim that

$H(\mathbf{p})=H(p_1, 1-p_1)+(1-p_1)H\left(\frac{p_2}{1-p_1}, \frac{p_3}{1-p_1}, \ldots, \frac{p_n}{1-p_1}\right).$

Shannon showed that given these three assumptions, there is a unique form that $$H$$ must take:

$$H\propto -\sum_{i=1}^n p_i \log p_i=\sum_{i=1}^n p_i \log \frac{1}{p_i}.$$

He named this measure of uncertainty entropy, because the form of $$H$$ bears striking similarity to that of Gibbs Entropy in statistical thermodynamics.1

Shannon observes that $$H$$ has many other interesting properties:

1. Entropy $$H$$ is 0 if and only if exactly one event has probability 1 and the rest have probability 0. (Uncertainty vanishes only when we are certain about the outcomes.)
2. Entropy $$H$$ is maximized when the $$p_i$$ values are equal.
3. The joint entropy of two events is less than or equal to the sum of the individual entropies. $$H(x, y)=H(x)+H(y)$$ only when $$x$$ and $$y$$ are independent events.