- What’s there?
- Bernoulli and Binomial Distributions
- Poisson Distribution
- Categorical and Multinomial Distributions
- Sources
What’s there?
Like Foundations of Probability, I wanted to cover some more basics before Gaussians really quickly.
Why do I write this?
I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).
Bernoulli and Binomial Distributions
Why together?
They form the basis for modeling any experiment with a binary outcome.
Foundation – The Single Trial
Bernoulli Trial
A single experiment with exactly two possible, mutually exclusive outcomes. These outcomes are generically labeled “success” and “failure”.
Bernoulli Distribution
A random variable
is said to follow a Bernoulli distribution if it is the outcome of a single Bernoulli trial.
The distribution is governed by a single parameter, , which represents the probability of success.
Probability Mass Function (PMF): The PMF of a Bernoulli random variable is
This can be written more compactly as
Moments:
- Expectation:
- Variance:
Foundation – Sum of Trials
Binomial Experiment
The Binomial distribution arises from extending the Bernoulli trial. It describes the outcome of an experiment that satisfies the following four conditions:
1. The experiment consists of a fixed number of trials,
.
2. Each trial is independent of the others.
3. Each trial has only two possible outcomes (success/failure).
4. The probability of success,, is the same for each trial.
Binomial Distribution
A random variable
is said to follow a Binomial distribution if it represents the total number of successes in
independent Bernoulli trials. It is governed by two parameters: the number of trials
and the probability of success
. We write
.
Probability Mass Function (PMF): To find the probability of observing exactly successes in
trials, we need two components:
- The probability of any specific sequence of
successes and
failures is
, due to independence.
- The number of distinct ways to arrange
successes among
trials is given by the binomial coefficient:
. Combining these gives the PMF:
Moments: The Binomial random variable can be seen as the sum of
independent Bernoulli random variables,
where
. Using the linearity of expectation and the property that the variance of a sum of independent variables is the sum of their variances:
- Expectation:
- Variance:
Poisson Distribution
The Poisson distribution can be derived as the limit of the Binomial distribution, , as the number of trials
goes to infinity while the probability of success
goes to zero, in such a way that their product remains constant:
.
Huh?
Imagine dividing a one-hour interval into one-second subintervals. Let the “success” be an event (e.g., a customer arriving) occurring in a subinterval. The probability
of success in any one-second subinterval is very small, but the number of trials
is very large. In this limit, the Binomial PMF converges to the Poisson PMF.
Properties
Probability Mass Function (PMF):
The range of a Poisson variable is countably infinite. The term is a normalization constant that ensures the probabilities sum to 1.
Moments:
- Expectation:
- Variance:
When is this even used? Predicting counts.
What if we want to predict the count of things? That is,
When the target variable in a regression problem is a count (e.g., number of purchases a customer makes in a month, number of clicks on an ad), standard linear regression is inappropriate.
Why?
A linear regression model doesn’t know the target variable is a count. It can easily predict negative values (e.g., -2 purchases) or fractional values (e.g., 4.7 clicks), which are nonsensical in this context.
Moreover, and this is important,
Linear regression assumes that the variability (variance) of the data points is the same regardless of their predicted value. Huh?
I have described a statistical property called heteroscedasticity, where the variability of a variable is unequal across the range of values of a second variable that predicts it.
In simpler terms, the data’s “spread” or “scatter” is not constant!
Standard linear regression assumes homoscedasticity. This means it expects the errors (the difference between the actual data points and the regression line) to have the same variance at all levels of the predictor variable.
Think of it as a consistent “error band” around the regression line. The amount of scatter should be roughly the same for small predicted values as it is for large ones.
Count data is often heteroscedastic. The variance tends to increase as the mean count increases. For instance, consider the example of daily website visitors –
- Small Personal Blog: The average number of visitors might be 10 per day. The daily count probably won’t vary much, maybe staying within a tight range like 5 to 15. The variance is low.
- Popular Website: The average might be 10,000 visitors per day. On a slow day, it might get 8,000, and on a viral day, it might get 15,000. The range of possible values is much wider, so the variance is high.
What do we do instead?
Poisson regression is used instead, where the response variable is assumed to follow a Poisson distribution, and the logarithm of its mean parameter is modeled as a linear function of the features.
- This distribution is defined only for non-negative integers, so the model’s underlying assumption matches the nature of the data.
- Also, a key property of the Poisson distribution is that its mean is equal to its variance, which naturally handles the issue of non-constant variance.
How does it work?
Instead of modeling the mean count directly, Poisson regression models the logarithm of the mean as a linear combination of the features.
Using the log ensures that the predicted mean, , will always be positive, thus preventing the model from making impossible negative predictions. This logarithmic transformation is called a link function.
Categorical and Multinomial Distributions
These are the direct generalizations of the Bernoulli and Binomial distributions to experiments with more than two possible outcomes.
Foundation – The Single Multi-Class Trial
Categorical Trial:
This is a single experiment with
possible, mutually exclusive outcomes (or “categories”).
Categorical Distribution:
A random variable
follows a Categorical distribution if it is the outcome of a single such trial.
The distribution is governed by a parameter vector , where
is the probability of the
-th outcome, and
.
- Representation: The outcome is typically represented using one-hot encoding. A trial resulting in category
is represented by a
-dimensional vector
, where the 1 is in the
-th position.
- Probability Mass Function (PMF): For a one-hot encoded vector
:
Foundation – Sum of Multi-Class Trials
Multinomial Experiment
This is an experiment consisting of
independent Categorical trials, each with the same probability vector
.
Multinomial Distribution
A random vector
follows a Multinomial distribution if each component
represents the total count for the
-th category in
trials. It is governed by parameters
and
. We write
.
Probability Mass Function (PMF): The probability of observing exactly outcomes of category 1,
of category 2, …, up to
of category
(where
) is given by:
The first term is the multinomial coefficient, which counts the number of ways to arrange the outcomes.
Moments:
- Expectation:
- Variance:
(Each marginal is a Binomial)
- Covariance:
for
. The counts are negatively correlated because an increase in the count of one category must come at the expense of another.
Sources
- Mathematics for Machine Learning (link)
Leave a comment