Uniform, Exponential, Beta & Dirichlet

What’s there?

Covers these things –

Uniform & Exponential distributions
Beta & Dirichlet (distributions over probabilities, key for Bayesian inference)

This should conclude stuff from continuous distributions that I wanted to talk about.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Uniform

Wdym what’s uniform? We’ve already used it so many times.

$f(x | a, b) = \begin{cases} \frac{1}{b-a} & \text{if } a \le x \le b \\ 0 & \text{otherwise} \end{cases}$

$F(x) = \begin{cases} 0 & \text{if } x < a \\ \frac{x-a}{b-a} & \text{if } a \le x \le b \\ 1 & \text{if } x > b \end{cases}$

Exponential

The Exponential distribution is a continuous distribution that models the time between events in a Poisson process.

Foundational insight

Recall that a Poisson process describes events occurring independently at a constant average rate, $\lambda$. The Poisson distribution counts the number of events in a fixed interval.

The Exponential distribution describes the waiting time for the very next event to occur.

Ok? Cool, I also want to say that it is memoryless.

The Exponential distribution is the only continuous distribution that is memoryless. This means that the probability of an event occurring in the next instant is independent of how much time has already elapsed. If you have been waiting for 10 minutes for a bus whose arrival follows an exponential distribution, the probability of it arriving in the next minute is exactly the same as it was when you first arrived. $P(X > s+t | X > s) = P(X > t)$ .

Theoretical development

Formal Definition: A continuous random variable $T$ follows an Exponential distribution with rate parameter $\lambda > 0$ , denoted $T \sim \text{Exp}(\lambda)$ , if its PDF is:

$f(t | \lambda) = \lambda e^{-\lambda t} \mathbf{1}_{t \ge 0}$

The rate parameter $\lambda$ is the same as the rate parameter in the underlying Poisson process (events per unit of time).

$F(t) = P(T \le t) = 1 - e^{-\lambda t} \quad \text{for } t \ge 0$

Expectation (Mean): $\mathbb{E}[T] = 1/\lambda$ . This is intuitive: if events occur at a rate of $\lambda=2$ per hour, the average waiting time is $1/2$ an hour.

And, of course, the variance –

$\text{Var}(T) = 1/\lambda^2$ .

Beta

Ok, now we start the stuff that I really wanted to write this prose for.

Introduction

Beta and Dirichlet Distributions – They are not distributions over data itself, but distributions over the parameters of other distributions. This makes them essential tools for Bayesian inference.

The Beta distribution is a continuous distribution defined on the interval $[0, 1]$ . It is the canonical distribution for modeling uncertainty about a probability, such as the bias of a coin.

Foundations

We must talk about the Gamma Function.

It’s essentially the continuous form of the factorial. It is defined as –

$\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt$

The crucial connection between the Gamma function and the factorial is given by the identity for any positive integer n –

$\Gamma(n) = (n-1)!$

Properties of th Gamme function

Recursion Formula: Similar to the factorial,

$\Gamma(z+1) = z\Gamma(z)$

Domain and Poles: The Gamma function is defined for all complex numbers except for non-positive integers (0, -1, -2, …), where it has simple poles. This means the function goes to infinity at these points.
Special Values: While the Gamma function can be challenging to compute directly for many values, some special values are well-known. A particularly famous one is:

$\Gamma\left(\frac{1}{2}\right) = \sqrt{\pi}$

This is important and can be uses this further as, $\Gamma\left(\frac{3}{2}\right) = \frac{1}{2}\Gamma\left(\frac{1}{2}\right) = \frac{\sqrt{\pi}}{2}$ .

First principles

Consider a Bernoulli trial (e.g., a coin flip) with an unknown probability of success, $p$ . Before we flip the coin, we are uncertain about the value of $p$ . The Beta distribution provides a way to express this uncertainty.

It is a probability distribution over the parameter $p$ .

Formal Definition: A random variable $P$ follows a Beta distribution, denoted $P \sim \text{Beta}(\alpha, \beta)$ , if its PDF is:

$f(p | \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1}(1-p)^{\beta-1}, \quad p \in [0,1]$

where $\alpha > 0$ and $\beta > 0$ are its hyperparameters (parameters of a distribution over a parameter). The term with the Gamma function $\Gamma(\cdot)$ is a normalization constant.

Interpretation of Hyperparameters: The hyperparameters $\alpha$ and $\beta$ can be intuitively understood as “pseudo-counts” of successes and failures, respectively.

$\alpha-1$ is the power of $p$ (the probability of success).
$\beta-1$ is the power of $(1-p)$ (the probability of failure).
$\text{Beta}(1, 1)$ corresponds to a uniform distribution, representing complete prior uncertainty.
$\alpha > \beta$ pushes the probability mass towards 1.
$\beta > \alpha$ pushes the probability mass towards 0.
Large $\alpha$ and $\beta$ create a distribution sharply peaked around its mean, representing strong prior belief.

Conjugacy

Conjugacy: In Bayesian inference, a prior distribution is conjugate to a likelihood if the resulting posterior distribution belongs to the same family of distributions as the prior.

What? It means that if you choose a specific type of prior distribution (your initial belief) that “matches” your likelihood (the data you observe), your posterior distribution (your updated belief) will be the same type of distribution as your prior, just with updated parameters.

This gets us to the Beta-Bernoulli Model –

Prior: We place a Beta prior on the unknown probability $p$ : $p \sim \text{Beta}(\alpha, \beta)$ .
Likelihood: We observe data from a Bernoulli or Binomial process. Suppose we observe $s$ successes and $f$ failures. The likelihood is proportional to $p^s(1-p)^f$ .
Posterior: According to Bayes’ theorem, the posterior is proportional to the likelihood times the prior:

$p(p | \text{data}) \propto p^s(1-p)^f \times p^{\alpha-1}(1-p)^{\beta-1} = p^{s+\alpha-1}(1-p)^{f+\beta-1}$

This is the kernel of another Beta distribution. The posterior is:

$p(p | \text{data}) = \text{Beta}(\alpha+s, \beta+f)$

Interpretation: The learning process is incredibly simple: we just add the observed counts of successes and failures to our prior pseudo-counts.

Dirichlet

The Dirichlet distribution is the multivariate generalization of the Beta distribution. It is a distribution over a probability vector.

First principles

The Problem: Consider a Categorical or Multinomial trial with $K$ possible outcomes and an unknown probability vector $\mathbf{p} = [p_1, \dots, p_K]^\top$ , where $\sum p_k = 1$ . The Dirichlet distribution models our uncertainty about this entire probability vector.
The Simplex: The domain of the Dirichlet distribution is the standard $(K-1)$ -simplex, which is the set of $K$ -dimensional vectors whose components are non-negative and sum to 1.
Formal Definition: A random vector $\mathbf{P}$ follows a Dirichlet distribution, denoted $\mathbf{P} \sim \text{Dir}(\boldsymbol{\alpha})$ , if its PDF is:

$f(\mathbf{p} | \boldsymbol{\alpha}) = \frac{\Gamma\left(\sum_{k=1}^K \alpha_k\right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1}$

where the hyperparameter $\boldsymbol{\alpha} = [\alpha_1, \dots, \alpha_K]^\top$ is a vector of positive real numbers, which can be interpreted as pseudo-counts for each of the $K$ categories.

Conjugacy, the Dirichlet-Multinomial Model

Prior: $\mathbf{p} \sim \text{Dir}(\boldsymbol{\alpha})$

Likelihood: We observe data from $n$ trials, with counts $\mathbf{x} = [x_1, \dots, x_K]^\top$ for each category. The likelihood is proportional to $\prod p_k^{x_k}$ .

Posterior:

$p(\mathbf{p} | \text{data}) \propto \left(\prod p_k^{x_k}\right) \left(\prod p_k^{\alpha_k-1}\right) = \prod p_k^{x_k+\alpha_k-1}$

The posterior is another Dirichlet distribution:

$p(\mathbf{p} | \text{data}) = \text{Dir}(\boldsymbol{\alpha} + \mathbf{x})$

As with the Beta, learning is a simple matter of adding the observed counts to the prior pseudo-counts.

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts