Gaussians 1

What’s there?

Started with some Probability and Statistics; doing Gaussians now. This is an info dump.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Introduction

Definition of a Univariate Gaussian

A random variable $X$ is said to follow a Gaussian distribution if its PDF is given by the specific functional form:

$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right)$

Properties

Symmetry: The PDF is perfectly symmetric about its mean, $\mu$ . This implies that the mean, median, and mode of the distribution are all equal.
The Normalization Constant: The term $\frac{1}{\sqrt{2\pi\sigma^2}}$ is the normalization constant. It ensures that the total area under the curve integrates to 1, as required for any valid PDF:

$\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right) dx = 1$

The Standard Normal Distribution: A special case of the Gaussian with $\mu=0$ and $\sigma^2=1$ is called the standard normal distribution, denoted $\mathcal{N}(0, 1)$ . Any Gaussian variable $X \sim \mathcal{N}(\mu, \sigma^2)$ can be transformed into a standard normal variable $Z$ via standardization:

$Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$

68-95-99.7 Empirical Rule

I’m just adding this just like that.

One Standard Deviation ( $\mu \pm \sigma$ ): Approximately 68.27% of the area under the curve lies within one standard deviation of the mean.

$P(\mu - \sigma \le X \le \mu + \sigma) \approx 0.68$

Two Standard Deviations ( $\mu \pm 2\sigma$ ): Approximately 95.45% of the area under the curve lies within two standard deviations of the mean.

$P(\mu - 2\sigma \le X \le \mu + 2\sigma) \approx 0.95$

Three Standard Deviations ( $\mu \pm 3\sigma$ ): Approximately 99.73% of the area under the curve lies within three standard deviations of the mean.

$P(\mu - 3\sigma \le X \le \mu + 3\sigma) \approx 0.997$

Stability under linear transformation

The Gaussian distribution is stable under affine transformations. If $X \sim \mathcal{N}(\mu, \sigma^2)$ and $Y = aX + b$ for scalars $a$ and $b$ , then $Y$ is also a Gaussian random variable. Its mean and variance are:

$\mathbb{E}[Y] = \mathbb{E}[aX+b] = a\mathbb{E}[X]+b = a\mu + b$
$\text{Var}(Y) = \text{Var}(aX+b) = a^2\text{Var}(X) = a^2\sigma^2$
Therefore, $Y \sim \mathcal{N}(a\mu+b, a^2\sigma^2)$ .

Also, the sum of two independent Gaussian random variables is a Gaussian random variable.

Now, more introductions

The Multivariate Gaussian Distribution

A random vector $\mathbf{X} \in \mathbb{R}^D$ is said to have a multivariate Gaussian distribution if its Probability Density Function (PDF) is given by:

$p(\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$

We use the notation $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma})$ .

Let’s deconstruct this formula –

Normalization Constant: The term $\frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}}$
The term $|\mathbf{\Sigma}|$ is the determinant of the covariance matrix, which geometrically represents the squared volume of the parallelepiped formed by the eigenvectors of $\mathbf{\Sigma}$ .
The Exponential Argument: The term inside the exponent, $\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})$
This quadratic form is known as the Mahalanobis distance squared. It measures the distance from a point $\mathbf{x}$ to the mean $\boldsymbol{\mu}$ , taking into account the covariance of the data. It is a unitless distance that accounts for the fact that the data cloud may be stretched and rotated.

The Geometry of the MVN

The loci of points with constant probability density (the isocontours) are the sets of $\mathbf{x}$ for which the Mahalanobis distance is constant:

$(\mathbf{x} - \boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) = \text{const}$

This is the equation of a hyperellipse centered at $\boldsymbol{\mu}$ . The shape and orientation of this hyperellipse are completely determined by the covariance matrix $\mathbf{\Sigma}$ .

Now, for understanding the geoemtry of the ellipse, do an eigendecomposition –

$\mathbf{\Sigma} = \mathbf{P}\mathbf{D}\mathbf{P}^\top = \sum_{i=1}^D \lambda_i \mathbf{p}_i \mathbf{p}_i^\top$

where $\mathbf{P}$ is an orthogonal matrix whose columns ${\mathbf{p}_i}$ are the eigenvectors of $\mathbf{\Sigma}$ , and $\mathbf{D}$ is a diagonal matrix of the corresponding non-negative eigenvalues ${\lambda_i}$ .

Principal Axes: The eigenvectors $\mathbf{p}_i$ of $\mathbf{\Sigma}$ define the directions of the principal axes of the hyperellipse.
Axis Lengths: The eigenvalues $\lambda_i$ determine the spread along these axes. The length of the semi-axis along the direction $\mathbf{p}_i$ is proportional to the square root of the corresponding eigenvalue, $\sqrt{\lambda_i}$ .

This means that the

MVN is a cloud of points whose main axes of variation are aligned with the eigenvectors of its covariance matrix.

Partitioning the vector

We begin with a joint Gaussian distribution over a high-dimensional random vector $\mathbf{x} \in \mathbb{R}^D$ . Let’s partition this vector into two disjoint subsets, $\mathbf{x}_a$ and $\mathbf{x}_b$ .

$\mathbf{x} = \begin{bmatrix} \mathbf{x}_a \ \mathbf{x}_b \end{bmatrix}$

The joint distribution is defined by a partitioned mean vector and a partitioned covariance matrix:

$p(\mathbf{x}_a, \mathbf{x}_b) = \mathcal{N}\left( \begin{bmatrix} \mathbf{x}_a \\ \mathbf{x}_b \end{bmatrix} \Bigg| \begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix}, \begin{bmatrix} \mathbf{\Sigma}_{aa} & \mathbf{\Sigma}_{ab} \\ \mathbf{\Sigma}_{ba} & \mathbf{\Sigma}_{bb} \end{bmatrix} \right)$

where $\boldsymbol{\mu}_a, \boldsymbol{\mu}_b$ are the mean vectors, $\mathbf{\Sigma}_{aa}, \mathbf{\Sigma}_{bb}$ are the covariance matrices of $\mathbf{x}_a$ and $\mathbf{x}_b$ respectively, and $\mathbf{\Sigma}_{ab} = \mathbf{\Sigma}_{ba}^\top$ is the cross-covariance matrix.

We can now ask two questions

Marginalization: What is the distribution of one subset, $p(\mathbf{x}_a)$ , if we have no information about the other?
Conditioning: What is the distribution of one subset, $p(\mathbf{x}_a | \mathbf{x}_b)$ , if we observe the value of the other?

Marginals and Conditions

Marginals

To find the marginal distribution $p(\mathbf{x}_a)$ , we must integrate out (or “marginalize”) the other variable, $\mathbf{x}_b$ , from the joint distribution:

$p(\mathbf{x}_a) = \int p(\mathbf{x}_a, \mathbf{x}_b) d\mathbf{x}_b$

The marginal distribution of a joint Gaussian is also a Gaussian.
The result of this integration is remarkably simple. We effectively just “read off” the corresponding parts from the joint distribution’s parameters:

$p(\mathbf{x}_a) = \mathcal{N}(\mathbf{x}_a | \boldsymbol{\mu}_a, \mathbf{\Sigma}_{aa})$

Intuition: If you have a 2D Gaussian “mountain,” its shadow projected onto the x-axis (integrating out y) is a 1D Gaussian bell curve.

Conditionals

Remember that $p(\mathbf{x}_a | \mathbf{x}_b) = p(\mathbf{x}_a, \mathbf{x}_b) / p(\mathbf{x}_b)$ .

The conditional distribution of a joint Gaussian is also a Gaussian.

The derivation involves algebraic manipulation of the exponents of the Gaussian PDFs (completing the square).

Without going too much into the maths, I’ll offer the following results –

Conditional Mean:

$\boldsymbol{\mu}_{a|b} = \boldsymbol{\mu}_a + \mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}(\mathbf{x}_b - \boldsymbol{\mu}_b)$

Conditional Covariance:

$\mathbf{\Sigma}_{a|b} = \mathbf{\Sigma}_{aa} - \mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba}$

Interpretation:

The conditional mean is a linear function of the observed variable $\mathbf{x}_b$ . It starts at the prior mean $\boldsymbol{\mu}_a$ and is adjusted based on how surprising the observation $\mathbf{x}_b$ is (i.e., its deviation from its own mean, $\mathbf{x}_b - \boldsymbol{\mu}_b$ ). The adjustment is scaled by the correlation term $\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}$ .
The conditional covariance is independent of the observation $\mathbf{x}_b$ . Observing data reduces our uncertainty. The new covariance is the prior covariance $\mathbf{\Sigma}_{aa}$ minus a positive semi-definite term that represents the information gained from the observation.

Product of Gaussian Densities

When applying Bayes’ theorem, we often need to multiply a Gaussian likelihood by a Gaussian prior. This product is also related to the Gaussian family.

Bayesian Inference Context

In a Bayesian setting, the posterior is proportional to the likelihood times the prior:

$p(\boldsymbol{\theta} | \mathcal{D}) \propto p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta})$

Yeah, this is just a fancy term.

What this is

The product of two Gaussian PDFs is an unnormalized Gaussian PDF.
Let’s consider two Gaussian densities (not necessarily normalized distributions) over the same variable $\mathbf{x}$ :

$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_a, \mathbf{A}) \quad \text{and} \quad \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_b, \mathbf{B})$

Their product is:

$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_a, \mathbf{A}) \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_b, \mathbf{B}) = c \cdot \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_c, \mathbf{C})$

where the new covariance $\mathbf{C}$ and mean $\boldsymbol{\mu}_c$ of the resulting Gaussian are given by:

New Covariance: $\mathbf{C} = (\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1}$
New Mean: $\boldsymbol{\mu}_c = \mathbf{C}(\mathbf{A}^{-1}\boldsymbol{\mu}_a + \mathbf{B}^{-1}\boldsymbol{\mu}_b)$
The term $c$ is a scaling constant that does not depend on $\mathbf{x}$ . It is the value needed to normalize the new Gaussian.

Interpretation in terms of Information: The inverse of a covariance matrix is known as the precision matrix. The equations show that when multiplying Gaussians, we simply add their precisions. This aligns with the intuition that combining two sources of information results in a more precise (higher precision, lower variance) final belief. The new mean is a precision-weighted average of the original means.

Sums and Linear Transformations of Gaussians

Setting

We apply a linear (or more generally, affine) transformation to $\mathbf{x}$ to produce a new random variable $\mathbf{y} \in \mathbb{R}^K$ :

$\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$

where $\mathbf{A} \in \mathbb{R}^{K \times D}$ and $\mathbf{b} \in \mathbb{R}^K$ .

We want to know

What is the distribution of $\mathbf{y}$ ?

Vineeth tells you what’s up

An affine transformation of a Gaussian random variable is also a Gaussian random variable.

Computing the New Mean: We use the linearity of expectation:

$\mathbb{E}[\mathbf{y}] = \mathbb{E}[\mathbf{A}\mathbf{x} + \mathbf{b}] = \mathbf{A}\mathbb{E}[\mathbf{x}] + \mathbf{b} = \mathbf{A}\boldsymbol{\mu} + \mathbf{b}$

Computing the New Covariance: We use the property of covariance under affine transformations:

$\text{Cov}(\mathbf{y}) = \text{Cov}(\mathbf{A}\mathbf{x} + \mathbf{b}) = \mathbf{A}\text{Cov}(\mathbf{x})\mathbf{A}^\top = \mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top$

Therefore, the distribution of the transformed variable $\mathbf{y}$ is:

$p(\mathbf{y}) = \mathcal{N}(\mathbf{y} | \mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)$

Special Case: Sum of Independent Gaussians

If we have two independent Gaussian random variables, $X \sim \mathcal{N}(\mu_X, \sigma_X^2)$ and $Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2)$ , their sum $Z=X+Y$ is also a Gaussian. This is a special case of the linear transformation where $\mathbf{x} = [X, Y]^\top$ , $\mathbf{A}=[1, 1]$ , and $\mathbf{b}=0$ . The resulting distribution is:

$Z \sim \mathcal{N}(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)$

The means add, and because of independence, the variances add.

Sources

Mathematics for Machine Learning (link)

recent posts

Leave a comment Cancel reply