What’s there?
Basic principles of information theory – entropy, KL divergence.
Why do I write this?
I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).
Entropy
Brain dump
The Concept of Information (or “Surprisal”): The foundational idea is that learning the outcome of an event is more “informative” if that event was unlikely.
Axioms for an Information Measure: Shannon proposed that a measure of information, , associated with an event of probability
, should have the following properties:
- Information is non-negative:
.
- If an event is certain (
), its information content is zero:
.
- If an event is less likely, it is more informative: if
, then
.
- The information gained from observing two independent events is the sum of their individual information contents:
.
The Information Formula: The unique function that satisfies these axioms is the negative logarithm:
The base of the logarithm, , determines the units of information. If
, the unit is bits. If
(natural log), the unit is nats. In machine learning, we typically use the natural logarithm. An event with probability
has
bit of information.
Theoretical development
- Formal Definition (Entropy): Let
be a discrete random variable that can take on
possible states
with probabilities
. The Shannon entropy of the random variable
, denoted
or
, is:
By convention, .
- Properties of Entropy:
- Minimum Entropy: The entropy is minimal (
) when there is no uncertainty. This occurs when one outcome is certain (
for some
) and all others are impossible (
for
).
- Maximum Entropy: The entropy is maximal when there is the most uncertainty. For a variable with
states, this occurs when all outcomes are equally likely, i.e., the distribution is uniform (
for all
). In this case,
.
- Geometric Interpretation: Entropy measures the “flatness” or “peakedness” of a probability distribution. A peaked distribution has low entropy; a flat distribution has high entropy.
- Minimum Entropy: The entropy is minimal (
Cross-Entropy
Theory
- Formal Definition (Cross-Entropy): Let
be the true probability distribution over a set of events, and
be our model’s predicted probability distribution over the same set of events. The cross-entropy of
with respect to
is:
Notice the structure: we are averaging the information content of the model’s probabilities () over the true probabilities (
).
What does this mean?
Think about encoding the letters of the English alphabet into Morse code.
- True Distribution (
): In real English text, the letter ‘E’ is extremely common, while ‘Z’ is very rare. An efficient code would give ‘E’ a very short representation (like a single dot) and ‘Z’ a much longer one. This is the optimal encoding scheme based on the true frequency of letters.
- Model Distribution (
): Now, suppose you mistakenly believe that ‘Z’ is the most common letter and ‘E’ is the rarest. You design a new Morse code based on this incorrect assumption. You’d give ‘Z’ a short code and ‘E’ a very long, inefficient one.
Cross-entropy is the average length of the messages you’d have to send using your inefficient, ‘Z’-focused code to communicate actual English text. Because your code is optimized for the wrong reality, your messages will be, on average, much longer than necessary.
The extra length is the “penalty” for your incorrect assumption.
Deconstructing the formula –
The cross-entropy of relative to
is calculated as:
Let’s break this down:
: The true probability of an event
. This acts as a weight. We care more about events that actually happen often.
: This relates to the “cost” of encoding event
using our model’s assumed probability. If our model says an event is very likely (
is high), the cost is low. If it says an event is very unlikely (
is low), the cost is very high.
: We sum this “weighted cost” over all possible events to get the average cost, or the cross-entropy.
Relationship with KL Divergence
Cross-entropy is intimately related to Shannon entropy and the Kullback-Leibler (KL) Divergence, which measures the “extra” bits/nats needed due to using the wrong code.
Cross-Entropy = Entropy + KL Divergence.
This is like super important, since
In machine learning, the true data distribution
is fixed (but unknown). We are trying to find a model
that is as close to
as possible. Since
is a constant with respect to our model, minimizing the cross-entropy
is equivalent to minimizing the KL divergence
. Therefore, cross-entropy serves as an excellent loss function.
Kullback-Leibler (KL) Divergence
Theory
Let
and
be two probability distributions over the same random variable
. The KL Divergence of
from
, denoted
, is defined as the expectation of the logarithmic difference between the two distributions, where the expectation is taken with respect to the true distribution
.
- For discrete distributions:
- For continuous distributions:
With cross entropy
As shown earlier,
.
The KL divergence (the penalty) is the difference between the average message length using the wrong code (
) and the average message length using the optimal code (
).
Minimizing and maximising
- Minimizing
: This is the “forward KL” used in Maximum Likelihood Estimation. It forces
to be non-zero wherever
is non-zero. This leads to “mode-covering” or “zero-avoiding” behavior, where the model
tries to be broad enough to capture all the probability mass of the true distribution
.
- Minimizing
: This is the “reverse KL” used in Variational Inference. It forces
to be zero wherever
is zero. This leads to “mode-seeking” or “zero-forcing” behavior, where the approximation
tends to find and fit one of the modes of a multi-modal true distribution
.
Um, what?
Mode Covering
This version asks, “What’s the penalty for my model failing to account for something that is actually true (
)?”
- The Penalty: Look at the term
. If there is a place where the true probability is high (
) but your model assigns it a near-zero probability (
), the ratio
becomes huge. The logarithm of this huge number is also huge, resulting in an infinite KL divergence.
- Behavior (“Zero-Avoiding”): To avoid this infinite penalty, the model is forced to place some probability mass everywhere the true distribution has mass. It learns to avoid assigning zero probability where it shouldn’t.
- Result (“Mode-Covering”): If the true distribution
has multiple peaks (is multi-modal), the model
must stretch itself out to cover all of them to avoid the infinite penalty. This results in a broad, “averaged” approximation that covers all modes, even if it doesn’t perfectly capture any single one. This is why it’s used in Maximum Likelihood Estimation, where the goal is to find a single set of parameters that best explains all the observed data.
Mode Seeking
This version flips the question and asks, “What’s the penalty for my model hallucinating something that is actually impossible (
)?”
- The Penalty: The formula for reverse KL is
. Now, consider a place where your model assigns some probability (
), but the true probability is zero (
). The ratio
becomes infinite, again leading to an infinite KL divergence.
- Behavior (“Zero-Forcing”): To avoid this infinite penalty, the model is forced to have zero probability wherever the true distribution has zero probability. It learns to only place probability mass in regions where it’s sure the true distribution exists.
- Result (“Mode-Seeking”): If the true distribution
has multiple peaks separated by regions of zero probability, the model
cannot stretch to cover them all because that would force it to place probability in the zero-regions, resulting in an infinite penalty. Instead, the safest strategy is to pick one single mode of
and fit it as well as possible. This is why it’s used in Variational Inference, where we often want a simpler, tractable approximation (
) that captures the most significant part of a complex true posterior distribution (
).
Mutual Information
Understanding
How much does knowing the value of variable
reduce my uncertainty about variable
?
Recall that entropy,
, measures our total uncertainty about
. After we observe the value of
, our uncertainty about
is now given by the conditional entropy,
. Mutual Information is simply the reduction in uncertainty.
Theory
The Mutual Information between two random variables
and
, denoted
, is defined as:
where
is the entropy of
and
is the conditional entropy.
In terms of KL Divergence
is the true joint distribution of the two variables, capturing all dependencies between them.
is the hypothetical joint distribution that we would have if the two variables were perfectly independent.
- Therefore, the Mutual Information is the KL divergence from the independent case to the true joint case. It measures how much the true joint distribution deviates from the assumption of independence. If the variables are truly independent, then
, and the KL divergence is zero.
Sources
- Mathematics for Machine Learning (link)
Leave a comment