Fundamental Discrete Probability Distributions

What’s there?

Like Foundations of Probability, I wanted to cover some more basics before Gaussians really quickly.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Bernoulli and Binomial Distributions

Why together?

They form the basis for modeling any experiment with a binary outcome.

Foundation – The Single Trial

Bernoulli Trial

A single experiment with exactly two possible, mutually exclusive outcomes. These outcomes are generically labeled “success” and “failure”.

Bernoulli Distribution

A random variable $X$ is said to follow a Bernoulli distribution if it is the outcome of a single Bernoulli trial.

The distribution is governed by a single parameter, $p \in [0, 1]$ , which represents the probability of success.

Probability Mass Function (PMF): The PMF of a Bernoulli random variable is

$P(X=x | p) = \begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases}$

This can be written more compactly as

$p(x|p) = p^x (1-p)^{1-x}, \quad x \in {0, 1}$

Moments:

Expectation: $\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p$
Variance: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = (1^2 \cdot p + 0^2 \cdot (1-p)) - p^2 = p - p^2 = p(1-p)$

Foundation – Sum of Trials

Binomial Experiment

The Binomial distribution arises from extending the Bernoulli trial. It describes the outcome of an experiment that satisfies the following four conditions:

1. The experiment consists of a fixed number of trials, $n$ .
2. Each trial is independent of the others.
3. Each trial has only two possible outcomes (success/failure).
4. The probability of success, $p$ , is the same for each trial.

Binomial Distribution

A random variable $K$ is said to follow a Binomial distribution if it represents the total number of successes in $n$ independent Bernoulli trials. It is governed by two parameters: the number of trials $n \in \mathbb{N}$ and the probability of success $p \in [0, 1]$ . We write $K \sim \text{Bin}(n, p)$ .

Probability Mass Function (PMF): To find the probability of observing exactly $k$ successes in $n$ trials, we need two components:

The probability of any specific sequence of $k$ successes and $n-k$ failures is $p^k(1-p)^{n-k}$ , due to independence.
The number of distinct ways to arrange $k$ successes among $n$ trials is given by the binomial coefficient: $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ . Combining these gives the PMF:

$P(K=k | n, p) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in {0, 1, \dots, n}$

Moments: The Binomial random variable $K$ can be seen as the sum of $n$ independent Bernoulli random variables, $K = \sum_{i=1}^n X_i$ where $X_i \sim \text{Bernoulli}(p)$ . Using the linearity of expectation and the property that the variance of a sum of independent variables is the sum of their variances:

Expectation: $\mathbb{E}[K] = \mathbb{E}[\sum X_i] = \sum \mathbb{E}[X_i] = \sum p = np$
Variance: $\text{Var}(K) = \text{Var}[\sum X_i] = \sum \text{Var}[X_i] = \sum p(1-p) = np(1-p)$

Poisson Distribution

The Poisson distribution can be derived as the limit of the Binomial distribution, $\text{Bin}(n, p)$ , as the number of trials $n$ goes to infinity while the probability of success $p$ goes to zero, in such a way that their product remains constant: $np = \lambda$ .

Huh?

Imagine dividing a one-hour interval into $n=3600$ one-second subintervals. Let the “success” be an event (e.g., a customer arriving) occurring in a subinterval. The probability $p$ of success in any one-second subinterval is very small, but the number of trials $n$ is very large. In this limit, the Binomial PMF converges to the Poisson PMF.

Properties

Probability Mass Function (PMF):

$P(K=k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in {0, 1, 2, \dots}$

The range of a Poisson variable is countably infinite. The term $e^{-\lambda}$ is a normalization constant that ensures the probabilities sum to 1.

Moments:

Expectation: $\mathbb{E}[K] = \lambda$
Variance: $\text{Var}(K) = \lambda$

When is this even used? Predicting counts.

What if we want to predict the count of things? That is,

When the target variable in a regression problem is a count (e.g., number of purchases a customer makes in a month, number of clicks on an ad), standard linear regression is inappropriate.

Why?

A linear regression model doesn’t know the target variable is a count. It can easily predict negative values (e.g., -2 purchases) or fractional values (e.g., 4.7 clicks), which are nonsensical in this context.

Moreover, and this is important,

Linear regression assumes that the variability (variance) of the data points is the same regardless of their predicted value. Huh?

I have described a statistical property called heteroscedasticity, where the variability of a variable is unequal across the range of values of a second variable that predicts it.

In simpler terms, the data’s “spread” or “scatter” is not constant!

Standard linear regression assumes homoscedasticity. This means it expects the errors (the difference between the actual data points and the regression line) to have the same variance at all levels of the predictor variable.

Think of it as a consistent “error band” around the regression line. The amount of scatter should be roughly the same for small predicted values as it is for large ones.

Count data is often heteroscedastic. The variance tends to increase as the mean count increases. For instance, consider the example of daily website visitors –

Small Personal Blog: The average number of visitors might be 10 per day. The daily count probably won’t vary much, maybe staying within a tight range like 5 to 15. The variance is low.

Popular Website: The average might be 10,000 visitors per day. On a slow day, it might get 8,000, and on a viral day, it might get 15,000. The range of possible values is much wider, so the variance is high.

What do we do instead?

Poisson regression is used instead, where the response variable is assumed to follow a Poisson distribution, and the logarithm of its mean parameter $\lambda$ is modeled as a linear function of the features.

This distribution is defined only for non-negative integers, so the model’s underlying assumption matches the nature of the data.
Also, a key property of the Poisson distribution is that its mean is equal to its variance, which naturally handles the issue of non-constant variance.

How does it work?

Instead of modeling the mean count directly, Poisson regression models the logarithm of the mean as a linear combination of the features.

$\log(\lambda) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n$

Using the log ensures that the predicted mean, $\lambda = \exp(\beta_0 + \beta_1x_1 + \dots)$ , will always be positive, thus preventing the model from making impossible negative predictions. This logarithmic transformation is called a link function.

Categorical and Multinomial Distributions

These are the direct generalizations of the Bernoulli and Binomial distributions to experiments with more than two possible outcomes.

Foundation – The Single Multi-Class Trial

Categorical Trial:

This is a single experiment with $K$ possible, mutually exclusive outcomes (or “categories”).

Categorical Distribution:

A random variable $X$ follows a Categorical distribution if it is the outcome of a single such trial.

The distribution is governed by a parameter vector $\mathbf{p} = [p_1, \dots, p_K]^\top$ , where $p_k$ is the probability of the $k$ -th outcome, and $\sum_{k=1}^K p_k = 1$ .

Representation: The outcome is typically represented using one-hot encoding. A trial resulting in category $k$ is represented by a $K$ -dimensional vector $\mathbf{x} = [0, \dots, 1, \dots, 0]^\top$ , where the 1 is in the $k$ -th position.
Probability Mass Function (PMF): For a one-hot encoded vector $\mathbf{x}$ :

$p(\mathbf{x}|\mathbf{p}) = \prod_{k=1}^K p_k^{x_k}$

Foundation – Sum of Multi-Class Trials

Multinomial Experiment

This is an experiment consisting of $n$ independent Categorical trials, each with the same probability vector $\mathbf{p}$ .

Multinomial Distribution

A random vector $\mathbf{X} = [X_1, \dots, X_K]^\top$ follows a Multinomial distribution if each component $X_k$ represents the total count for the $k$ -th category in $n$ trials. It is governed by parameters $n$ and $\mathbf{p}$ . We write $\mathbf{X} \sim \text{Multinomial}(n, \mathbf{p})$ .

Probability Mass Function (PMF): The probability of observing exactly $x_1$ outcomes of category 1, $x_2$ of category 2, …, up to $x_K$ of category $K$ (where $\sum x_k = n$ ) is given by:

$p(x_1, \dots, x_K | n, \mathbf{p}) = \frac{n!}{x_1! x_2! \cdots x_K!} \prod_{k=1}^K p_k^{x_k}$

The first term is the multinomial coefficient, which counts the number of ways to arrange the outcomes.

Moments:

Expectation: $\mathbb{E}[X_k] = np_k$
Variance: $\text{Var}(X_k) = np_k(1-p_k)$ (Each marginal is a Binomial)
Covariance: $\text{Cov}(X_i, X_j) = -np_i p_j$ for $i \neq j$ . The counts are negatively correlated because an increase in the count of one category must come at the expense of another.

Sources

Mathematics for Machine Learning (link)

recent posts

One response to “Fundamental Discrete Probability Distributions”

Leave a comment Cancel reply

recent posts