- What’s there?
- Introduction
- Stability under linear transformation
- Now, more introductions
- Marginals and Conditions
- Product of Gaussian Densities
- Sums and Linear Transformations of Gaussians
- Sources
What’s there?
Started with some Probability and Statistics; doing Gaussians now. This is an info dump.
Why do I write this?
I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).
Introduction
Definition of a Univariate Gaussian
A random variable is said to follow a Gaussian distribution if its PDF is given by the specific functional form:
Properties
- Symmetry: The PDF is perfectly symmetric about its mean,
. This implies that the mean, median, and mode of the distribution are all equal.
- The Normalization Constant: The term
is the normalization constant. It ensures that the total area under the curve integrates to 1, as required for any valid PDF:
- The Standard Normal Distribution: A special case of the Gaussian with
and
is called the standard normal distribution, denoted
. Any Gaussian variable
can be transformed into a standard normal variable
via standardization:
68-95-99.7 Empirical Rule
I’m just adding this just like that.
- One Standard Deviation (
): Approximately 68.27% of the area under the curve lies within one standard deviation of the mean.
- Two Standard Deviations (
): Approximately 95.45% of the area under the curve lies within two standard deviations of the mean.
- Three Standard Deviations (
): Approximately 99.73% of the area under the curve lies within three standard deviations of the mean.
Stability under linear transformation
The Gaussian distribution is stable under affine transformations. If and
for scalars
and
, then
is also a Gaussian random variable. Its mean and variance are:
Therefore,.
Also, the sum of two independent Gaussian random variables is a Gaussian random variable.
Now, more introductions
The Multivariate Gaussian Distribution
A random vector is said to have a multivariate Gaussian distribution if its Probability Density Function (PDF) is given by:
We use the notation .
Let’s deconstruct this formula –
- Normalization Constant: The term
The termis the determinant of the covariance matrix, which geometrically represents the squared volume of the parallelepiped formed by the eigenvectors of
.
- The Exponential Argument: The term inside the exponent,
This quadratic form is known as the Mahalanobis distance squared. It measures the distance from a pointto the mean
, taking into account the covariance of the data. It is a unitless distance that accounts for the fact that the data cloud may be stretched and rotated.
The Geometry of the MVN
The loci of points with constant probability density (the isocontours) are the sets of for which the Mahalanobis distance is constant:
This is the equation of a hyperellipse centered at . The shape and orientation of this hyperellipse are completely determined by the covariance matrix
.
Now, for understanding the geoemtry of the ellipse, do an eigendecomposition –
where is an orthogonal matrix whose columns
are the eigenvectors of
, and
is a diagonal matrix of the corresponding non-negative eigenvalues
.
- Principal Axes: The eigenvectors
of
define the directions of the principal axes of the hyperellipse.
- Axis Lengths: The eigenvalues
determine the spread along these axes. The length of the semi-axis along the direction
is proportional to the square root of the corresponding eigenvalue,
.
This means that the
MVN is a cloud of points whose main axes of variation are aligned with the eigenvectors of its covariance matrix.
Partitioning the vector
We begin with a joint Gaussian distribution over a high-dimensional random vector . Let’s partition this vector into two disjoint subsets,
and
.
The joint distribution is defined by a partitioned mean vector and a partitioned covariance matrix:
where are the mean vectors,
are the covariance matrices of
and
respectively, and
is the cross-covariance matrix.
We can now ask two questions
- Marginalization: What is the distribution of one subset,
, if we have no information about the other?
- Conditioning: What is the distribution of one subset,
, if we observe the value of the other?
Marginals and Conditions
Marginals
To find the marginal distribution , we must integrate out (or “marginalize”) the other variable,
, from the joint distribution:
The marginal distribution of a joint Gaussian is also a Gaussian.
The result of this integration is remarkably simple. We effectively just “read off” the corresponding parts from the joint distribution’s parameters:
Intuition: If you have a 2D Gaussian “mountain,” its shadow projected onto the x-axis (integrating out y) is a 1D Gaussian bell curve.
Conditionals
Remember that .
The conditional distribution of a joint Gaussian is also a Gaussian.
The derivation involves algebraic manipulation of the exponents of the Gaussian PDFs (completing the square).
Without going too much into the maths, I’ll offer the following results –
- Conditional Mean:
- Conditional Covariance:
Interpretation:
- The conditional mean is a linear function of the observed variable
. It starts at the prior mean
and is adjusted based on how surprising the observation
is (i.e., its deviation from its own mean,
). The adjustment is scaled by the correlation term
.
- The conditional covariance is independent of the observation
. Observing data reduces our uncertainty. The new covariance is the prior covariance
minus a positive semi-definite term that represents the information gained from the observation.
Product of Gaussian Densities
When applying Bayes’ theorem, we often need to multiply a Gaussian likelihood by a Gaussian prior. This product is also related to the Gaussian family.
Bayesian Inference Context
In a Bayesian setting, the posterior is proportional to the likelihood times the prior:
Yeah, this is just a fancy term.
What this is
The product of two Gaussian PDFs is an unnormalized Gaussian PDF.
Let’s consider two Gaussian densities (not necessarily normalized distributions) over the same variable :
Their product is:
where the new covariance and mean
of the resulting Gaussian are given by:
- New Covariance:
- New Mean:
- The term
is a scaling constant that does not depend on
. It is the value needed to normalize the new Gaussian.
Interpretation in terms of Information: The inverse of a covariance matrix is known as the precision matrix. The equations show that when multiplying Gaussians, we simply add their precisions. This aligns with the intuition that combining two sources of information results in a more precise (higher precision, lower variance) final belief. The new mean is a precision-weighted average of the original means.
Sums and Linear Transformations of Gaussians
Setting
We apply a linear (or more generally, affine) transformation to to produce a new random variable
:
where and
.
We want to know
What is the distribution of
?
Vineeth tells you what’s up
An affine transformation of a Gaussian random variable is also a Gaussian random variable.
Computing the New Mean: We use the linearity of expectation:
Computing the New Covariance: We use the property of covariance under affine transformations:
Therefore, the distribution of the transformed variable
is:
Special Case: Sum of Independent Gaussians
If we have two independent Gaussian random variables, and
, their sum
is also a Gaussian. This is a special case of the linear transformation where
,
, and
. The resulting distribution is:
The means add, and because of independence, the variances add.
Sources
- Mathematics for Machine Learning (link)
Leave a comment