Foundations of Probability

What’s there?

I want to cover like some basics of probability before I get into Gaussians. Saliently, this covers LLN and CLT.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Axioms of Probability

Imma just drone on about theory in this post.

Sample Space ( $\Omega$ ): The sample space is the set of all possible elementary outcomes of a random experiment. Each outcome $\omega \in \Omega$ must be mutually exclusive and the set must be exhaustive.

Event Space ( $\mathcal{F}$ ): An event is a subset of the sample space $\Omega$ . The event space, denoted $\mathcal{F}$ , is a collection of events to which we can assign probabilities. It is not always the set of all possible subsets (the power set), but it must be a $\sigma$ -algebra, which is a collection of subsets of $\Omega$ that satisfies three properties –

The empty set is included: $\emptyset \in \mathcal{F}$ .
Closure under complementation: If an event $A \in \mathcal{F}$ , then its complement $\Omega \setminus A$ is also in $\mathcal{F}$ .
Closure under countable unions: If $A_1, A_2, \dots$ is a countable sequence of events in $\mathcal{F}$ , then their union $latex

Probability Measure ( $P$ ): The probability measure is a function that maps an event in the event space $\mathcal{F}$ to a real number. This function must satisfy the Kolmogorov Axioms –

Non-negativity: The probability of any event is non-negative.

$P(A) \ge 0 \quad \forall A \in \mathcal{F}$

Normalization (Unit Measure): The probability of the entire sample space is 1. This means that some outcome is guaranteed to occur.

$P(\Omega) = 1$

Countable Additivity (or $\sigma$ -additivity): For any countable sequence of pairwise disjoint events $A_1, A_2, \dots$ in $\mathcal{F}$ (i.e., $A_i \cap A_j = \emptyset$ for $i \neq j$ ), the probability of their union is the sum of their individual probabilities.

$P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i)$

Random Variables

A random variable $X$ is a function $X: \Omega \to \mathbb{R}$ that assigns a real number to every possible outcome $\omega \in \Omega$ .

The key technical requirement is that this function must be measurable, which ensures that sets of the form $\{\omega \in \Omega \mid X(\omega) \le x\}$ are valid events in $\mathcal{F}$ , allowing us to assign probabilities to them.

PDF and PMF

The behavior of a discrete random variable is described by its PMF, $p_X(x)$ , which gives the probability of the variable taking on a specific value $x$ :

$p_X(x) = P(X=x) := P(\{\omega \in \Omega \mid X(\omega) = x\})$

The PMF satisfies $p_X(x) \ge 0$ and $\sum_x p_X(x) = 1$ .

The probability that a continuous random variable takes on any single specific value is zero. Instead, we describe its behavior with a PDF, $f_X(x)$ . The probability of the variable falling within an interval $[a, b]$ is the integral of the PDF over that interval:

$P(a \le X \le b) = \int_a^b f_X(x) dx$

The PDF satisfies $f_X(x) \ge 0$ and $\int_{-\infty}^\infty f_X(x) dx = 1$ . Note that, unlike a PMF, a PDF can have values greater than 1.

Also gonna throw this in here

A random variable is mixed if it has properties of both discrete and continuous variables.

Its probability distribution is a mixture of a PMF (for the discrete parts) and a PDF (for the continuous parts).

Conditional Probability, Independence, and the Chain Rule

Conditional Probability: The conditional probability of event $A$ occurring, given that event $B$ has already occurred, is denoted $P(A|B)$ . It represents an update to our belief about $A$ in light of the new information that $B$ is true. It is defined as:

$P(A|B) = \frac{P(A \cap B)}{P(B)}, \quad \text{provided } P(B) > 0$

Geometrically, this is a re-normalization of the probability space: we restrict our attention to the outcomes within event $B$ and re-normalize their probabilities to sum to one.

Statistical Independence: Two events $A$ and $B$ are statistically independent if the occurrence of one does not provide any information about the occurrence of the other. Mathematically (this is all I care about rn ngl), this means:

$P(A|B) = P(A) \quad \text{and} \quad P(B|A) = P(B)$

Substituting this into the definition of conditional probability yields the more common test for independence:

$P(A \cap B) = P(A)P(B)$

Two random variables $X$ and $Y$ are independent if this property holds for all possible values they can take: $p(x, y) = p(x)p(y)$ .

Chain Rule

In the language of random variables, for a set of variables $X_1, \dots, X_n$ , the chain rule allows us to express their joint distribution as:

$p(x_1, \dots, x_n) = \prod_{i=1}^n p(x_i | x_1, \dots, x_{i-1})$

Bayes’ Theorem

Bayes’ theorem provides a formal mechanism for inverting conditional probabilities.

Terminology

Hypothesis ( $H$ ): A proposition about the world whose truth we are uncertain about (e.g., “The patient has the disease,” “This email is spam”).

Evidence ( $E$ ): A new piece of data or observation that is relevant to the hypothesis.

Prior ( $P(H)$ ): Our initial belief in the hypothesis before observing the evidence.

Likelihood ( $P(E|H)$ ): The probability of observing the evidence if the hypothesis were true. This is the forward, generative model.

Posterior ( $P(H|E)$ ): Our updated belief in the hypothesis after observing the evidence. This is the quantity we want to compute.

The theorem itself

$P(H|E) = \frac{P(E|H)P(H)}{P(E)}$

The Marginal Likelihood (or Evidence): The denominator, $P(E)$ , is the total probability of observing the evidence, averaged over all possible hypotheses. It is computed using the law of total probability: $P(E) = P(E|H)P(H) + P(E|H^c)P(H^c)$ . It acts as a normalization constant, ensuring that the posterior probabilities sum to one.

The theorem can be written in a more memorable form using the terms:

$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$

Expectation, Variance and Moments

Imma just skip the basic stuff.

This is the Expectation of a Function –

$\mathbb{E}[g(X)] = \sum_{x} g(x) \, p(x) \quad \text{or} \quad \int_{-\infty}^{\infty} g(x) \, f(x) dx$

Moments

Moments about the Origin: The $k$ -th moment of a random variable $X$ about the origin is defined as the expected value of $X^k$ :

$\mu'_k = \mathbb{E}[X^k]$

The first moment, $\mu'_1 = \mathbb{E}[X]$ , is the mean.

Central Moments (Moments about the Mean): The $k$ -th central moment is the expected value of the $k$ -th power of the deviation from the mean, $(X-\mu)^k$ :

$\mu_k = \mathbb{E}[(X - \mu)^k]$

Random facts about central moments

The first central moment is always zero: $\mu_1 = \mathbb{E}[X-\mu] = \mathbb{E}[X] - \mu = 0$ .
The second central moment, $\mu_2 = \mathbb{E}[(X - \mu)^2]$ , is the variance. It is the most important measure of the spread or dispersion of the distribution. It is denoted $\text{Var}(X)$ or $\sigma^2$ . The standard deviation, $\sigma = \sqrt{\text{Var}(X)}$ , is the square root of the variance, returned to the original units of the variable.
The third central moment (normalized) is related to the skewness, which measures the asymmetry of the distribution.
The fourth central moment (normalized) is related to the kurtosis, which measures the “tailedness” or propensity for outliers.

Covariance and correlation

We’ve discussed covariance in PCA, Spectral Clustering and t-SNE.

Now, notice that the magnitude of the covariance is difficult to interpret because it depends on the units and variances of the individual variables.

A covariance of 100 might be very strong for one pair of variables but very weak for another.

To create a standardized, unit-free measure of the linear relationship, we normalize the covariance by the standard deviations of the two variables. This gives the Pearson correlation coefficient, $\rho$ –

$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$

$\rho = 1$ : Perfect positive linear relationship.

$\rho = -1$ : Perfect negative linear relationship.

$\rho = 0$ : No linear relationship (uncorrelated).

Law of Large Numbers (LLN)

It says that that the average of a large number of independent samples from a distribution converges to the theoretical expected value of that distribution.

It is the theorem that guarantees that sampling works.

Theoretical Development: Forms of the Law

Weak Law of Large Numbers: This law states that for any arbitrarily small positive number $\epsilon$ , the probability that the sample mean deviates from the true mean by more than $\epsilon$ approaches zero as the sample size $N$ goes to infinity.

$\lim_{N \to \infty} P(|\bar{X}_N - \mu| > \epsilon) = 0$

This is also called convergence in probability. It doesn’t guarantee that for a single long experiment the average will be close to the mean, but that it is overwhelmingly likely to be.

Strong Law of Large Numbers: This law makes a much stronger statement. It asserts that, with probability 1, the sample mean will converge to the true mean.

$P(\lim_{N \to \infty} \bar{X}_N = \mu) = 1$

This is also called almost sure convergence. This guarantees that for a single, infinitely long experiment, the sample average will eventually settle down at the true population mean.

Why is this important in ML?

The LLN is the theoretical justification for Empirical Risk Minimization.

When we minimize a loss function on a finite training set, we are computing an empirical average. The LLN gives us confidence that, with enough data, this empirical risk will be a good approximation of the true expected risk over the entire data distribution.

Central Limit Theorem (CLT)

Foundations

The Question: The LLN tells us where the sample mean $\bar{X}_N$ is going (it converges to $\mu$ ). The CLT tells us how it gets there: it describes the shape of the probability distribution of the sample mean around the true mean for a large but finite $N$ .
The Setup: Let $X_1, X_2, \dots, X_N$ be a sequence of i.i.d. random variables with finite mean $\mu$ and finite variance $\sigma^2$ .

Theoretical Development

Theorem (Central Limit Theorem): As the sample size $N$ becomes large, the distribution of the sample mean $\bar{X}_N$ approaches a normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2/N$ .

$\bar{X}_N \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{N}\right)$

More formally, the standardized version of the sample mean converges in distribution to a standard normal distribution:

$\frac{\bar{X}_N - \mu}{\sigma/\sqrt{N}} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } N \to \infty$

Why’s this so cool?

The CLT is incredibly powerful because it holds true regardless of the shape of the original distribution of the $X_i$ . Whether the individual variables are from a uniform, Bernoulli, exponential, or any other well-behaved distribution, the distribution of their average will always tend towards a Gaussian bell curve.
This theorem explains the ubiquity of the Gaussian distribution in nature and science.

Many complex phenomena can be modeled as the sum or average of many small, independent random effects. According to the CLT, the distribution of this aggregate phenomenon will be approximately Gaussian, even if the individual effects are not. For example, measurement error in an instrument is often the sum of many small, independent sources of error, so it is well-modeled by a Gaussian.

Basically, Gaussians are common and this can be exploited.

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts