Information Theory

What’s there?

Basic principles of information theory – entropy, KL divergence.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Entropy

Brain dump

The Concept of Information (or “Surprisal”): The foundational idea is that learning the outcome of an event is more “informative” if that event was unlikely.

Axioms for an Information Measure: Shannon proposed that a measure of information, $I(p)$ , associated with an event of probability $p$ , should have the following properties:

Information is non-negative: $I(p) \ge 0$ .
If an event is certain ( $p=1$ ), its information content is zero: $I(1) = 0$ .
If an event is less likely, it is more informative: if $p_1 < p_2$ , then $I(p_1) > I(p_2)$ .
The information gained from observing two independent events is the sum of their individual information contents: $I(p_1 p_2) = I(p_1) + I(p_2)$ .

The Information Formula: The unique function that satisfies these axioms is the negative logarithm:

$I(p) = -\log_b(p)$

The base of the logarithm, $b$ , determines the units of information. If $b=2$ , the unit is bits. If $b=e$ (natural log), the unit is nats. In machine learning, we typically use the natural logarithm. An event with probability $1/2$ has $-\log_2(1/2) = 1$ bit of information.

Theoretical development

Formal Definition (Entropy): Let $X$ be a discrete random variable that can take on $K$ possible states ${x_1, \dots, x_K}$ with probabilities $P(X=x_k) = p_k$ . The Shannon entropy of the random variable $X$ , denoted $H(X)$ or $H(P)$ , is:

$H(X) = \mathbb{E}[I(P(X))] = \sum_{k=1}^K p_k I(p_k) = -\sum_{k=1}^K p_k \log p_k$

By convention, $0 \log 0 = 0$ .

Properties of Entropy:
- Minimum Entropy: The entropy is minimal ( $H(X)=0$ ) when there is no uncertainty. This occurs when one outcome is certain ( $p_k=1$ for some $k$ ) and all others are impossible ( $p_j=0$ for $j \neq k$ ).
- Maximum Entropy: The entropy is maximal when there is the most uncertainty. For a variable with $K$ states, this occurs when all outcomes are equally likely, i.e., the distribution is uniform ( $p_k = 1/K$ for all $k$ ). In this case, $H(X) = \log K$ .
- Geometric Interpretation: Entropy measures the “flatness” or “peakedness” of a probability distribution. A peaked distribution has low entropy; a flat distribution has high entropy.

Cross-Entropy

Theory

Formal Definition (Cross-Entropy): Let $P$ be the true probability distribution over a set of events, and $Q$ be our model’s predicted probability distribution over the same set of events. The cross-entropy of $Q$ with respect to $P$ is:

$H(P, Q) = \mathbb{E}_{X \sim P}[I(Q(X))] = \sum_{k=1}^K p_k I(q_k) = -\sum_{k=1}^K p_k \log q_k$

Notice the structure: we are averaging the information content of the model’s probabilities ( $-\log q_k$ ) over the true probabilities ( $p_k$ ).

What does this mean?

Think about encoding the letters of the English alphabet into Morse code.

True Distribution ( $P$ ): In real English text, the letter ‘E’ is extremely common, while ‘Z’ is very rare. An efficient code would give ‘E’ a very short representation (like a single dot) and ‘Z’ a much longer one. This is the optimal encoding scheme based on the true frequency of letters.
Model Distribution ( $Q$ ): Now, suppose you mistakenly believe that ‘Z’ is the most common letter and ‘E’ is the rarest. You design a new Morse code based on this incorrect assumption. You’d give ‘Z’ a short code and ‘E’ a very long, inefficient one.

Cross-entropy is the average length of the messages you’d have to send using your inefficient, ‘Z’-focused code to communicate actual English text. Because your code is optimized for the wrong reality, your messages will be, on average, much longer than necessary.

The extra length is the “penalty” for your incorrect assumption.

Deconstructing the formula –

The cross-entropy of $Q$ relative to $P$ is calculated as:

$H(P, Q) = -\sum_{x} P(x) \log Q(x)$

Let’s break this down:

$P(x)$ : The true probability of an event $x$ . This acts as a weight. We care more about events that actually happen often.
$\log Q(x)$ : This relates to the “cost” of encoding event $x$ using our model’s assumed probability. If our model says an event is very likely ( $Q(x)$ is high), the cost is low. If it says an event is very unlikely ( $Q(x)$ is low), the cost is very high.
$-\sum_{x}$ : We sum this “weighted cost” over all possible events to get the average cost, or the cross-entropy.

Relationship with KL Divergence

Cross-entropy is intimately related to Shannon entropy and the Kullback-Leibler (KL) Divergence, which measures the “extra” bits/nats needed due to using the wrong code.

$\begin{aligned} H(P, Q) &= -\sum_k p_k \log q_k \ &= -\sum_k p_k \log \left( \frac{q_k}{p_k} p_k \right) \ &= -\sum_k p_k \log p_k - \sum_k p_k \log \frac{q_k}{p_k} \ &= H(P) + D_{KL}(P || Q) \end{aligned}$

Cross-Entropy = Entropy + KL Divergence.

This is like super important, since

In machine learning, the true data distribution $P$ is fixed (but unknown). We are trying to find a model $Q$ that is as close to $P$ as possible. Since $H(P)$ is a constant with respect to our model, minimizing the cross-entropy $H(P, Q)$ is equivalent to minimizing the KL divergence $D_{KL}(P || Q)$ . Therefore, cross-entropy serves as an excellent loss function.

Kullback-Leibler (KL) Divergence

Theory

Let $P(x)$ and $Q(x)$ be two probability distributions over the same random variable $x$ . The KL Divergence of $Q$ from $P$ , denoted $D_{KL}(P || Q)$ , is defined as the expectation of the logarithmic difference between the two distributions, where the expectation is taken with respect to the true distribution $P$ .

For discrete distributions:

$D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

For continuous distributions:

$D_{KL}(P || Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx$

With cross entropy

As shown earlier, $D_{KL}(P || Q) = H(P, Q) - H(P)$ .

The KL divergence (the penalty) is the difference between the average message length using the wrong code ( $H(P,Q)$ ) and the average message length using the optimal code ( $H(P)$ ).

Minimizing and maximising

Minimizing $D_{KL}(P || Q)$ : This is the “forward KL” used in Maximum Likelihood Estimation. It forces $Q$ to be non-zero wherever $P$ is non-zero. This leads to “mode-covering” or “zero-avoiding” behavior, where the model $Q$ tries to be broad enough to capture all the probability mass of the true distribution $P$ .
Minimizing $D_{KL}(Q || P)$ : This is the “reverse KL” used in Variational Inference. It forces $Q$ to be zero wherever $P$ is zero. This leads to “mode-seeking” or “zero-forcing” behavior, where the approximation $Q$ tends to find and fit one of the modes of a multi-modal true distribution $P$ .

Um, what?

Mode Covering

This version asks, “What’s the penalty for my model $Q$ failing to account for something that is actually true ( $P$ )?”

The Penalty: Look at the term $P(x) \log\left(\frac{P(x)}{Q(x)}\right)$ . If there is a place where the true probability is high ( $P(x) > 0$ ) but your model assigns it a near-zero probability ( $Q(x) \to 0$ ), the ratio $\frac{P(x)}{Q(x)}$ becomes huge. The logarithm of this huge number is also huge, resulting in an infinite KL divergence.
Behavior (“Zero-Avoiding”): To avoid this infinite penalty, the model is forced to place some probability mass everywhere the true distribution has mass. It learns to avoid assigning zero probability where it shouldn’t.
Result (“Mode-Covering”): If the true distribution $P$ has multiple peaks (is multi-modal), the model $Q$ must stretch itself out to cover all of them to avoid the infinite penalty. This results in a broad, “averaged” approximation that covers all modes, even if it doesn’t perfectly capture any single one. This is why it’s used in Maximum Likelihood Estimation, where the goal is to find a single set of parameters that best explains all the observed data.

Mode Seeking

This version flips the question and asks, “What’s the penalty for my model $Q$ hallucinating something that is actually impossible ( $P$ )?”

The Penalty: The formula for reverse KL is $D_{KL}(Q || P) = \sum_x Q(x) \log\left(\frac{Q(x)}{P(x)}\right)$ . Now, consider a place where your model assigns some probability ( $Q(x) > 0$ ), but the true probability is zero ( $P(x) = 0$ ). The ratio $\frac{Q(x)}{P(x)}$ becomes infinite, again leading to an infinite KL divergence.
Behavior (“Zero-Forcing”): To avoid this infinite penalty, the model is forced to have zero probability wherever the true distribution has zero probability. It learns to only place probability mass in regions where it’s sure the true distribution exists.
Result (“Mode-Seeking”): If the true distribution $P$ has multiple peaks separated by regions of zero probability, the model $Q$ cannot stretch to cover them all because that would force it to place probability in the zero-regions, resulting in an infinite penalty. Instead, the safest strategy is to pick one single mode of $P$ and fit it as well as possible. This is why it’s used in Variational Inference, where we often want a simpler, tractable approximation ( $Q$ ) that captures the most significant part of a complex true posterior distribution ( $P$ ).

Mutual Information

Understanding

How much does knowing the value of variable $Y$ reduce my uncertainty about variable $X$ ?

Recall that entropy, $H(X)$ , measures our total uncertainty about $X$ . After we observe the value of $Y$ , our uncertainty about $X$ is now given by the conditional entropy, $H(X|Y)$ . Mutual Information is simply the reduction in uncertainty.

$\text{Mutual Information} = (\text{Initial Uncertainty about X}) - (\text{Remaining Uncertainty about X after seeing Y})$

Theory

The Mutual Information between two random variables $X$ and $Y$ , denoted $I(X; Y)$ , is defined as:

$I(X; Y) = H(X) - H(X|Y)$

where $H(X)$ is the entropy of $X$ and $H(X|Y) = -\sum_{x,y} p(x,y)\log p(x|y)$ is the conditional entropy.

In terms of KL Divergence

$I(X; Y) = D_{KL}( p(x, y) || p(x)p(y) )$

$p(x, y)$ is the true joint distribution of the two variables, capturing all dependencies between them.

$p(x)p(y)$ is the hypothetical joint distribution that we would have if the two variables were perfectly independent.

Therefore, the Mutual Information is the KL divergence from the independent case to the true joint case. It measures how much the true joint distribution deviates from the assumption of independence. If the variables are truly independent, then $p(x,y)=p(x)p(y)$ , and the KL divergence is zero.

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts