Advanced Methods for generating RVs

What’s there?

While the Inverse Transform Sampling method provides a theoretical foundation for random sampling, its practical applicability is limited to distributions with an invertible Cumulative Distribution Function (CDF).

So, we have other stuff!

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Box-Muller Transform

We can use this to generate $\mathcal{N}(0, 1)$ from $U(0, 1)$ .

Foundational insight

Remeber that the CDF of the Gaussian distribution, which involves the error function ( $\text{erf}$ ), does not have a closed-form analytical inverse.

What else did you expect? I will give more insights through theory – I don’t want to mention them directly here.

So, I really just wrote this section to mention what is called the Jacobian determinant.

Just remember that

In probability, density is “probability per unit area.” When you switch coordinate systems, the definition of a “unit area” changes.

We’ll get back to this.

Theoretical development

Let $X \sim \mathcal{N}(0, 1)$ and $Y \sim \mathcal{N}(0, 1)$ be two independent standard normal random variables. Their joint PDF is:

$f(x, y) = f(x)f(y) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \cdot \frac{1}{\sqrt{2\pi}}e^{-y^2/2} = \frac{1}{2\pi}e^{-(x^2+y^2)/2}$

The expression $e^{-(x^2+y^2)/2}$ shows that the probability density depends only on the squared distance from the origin ( $x^2+y^2$ ), meaning the distribution is perfectly circular and symmetric.

Sounds familiar?

Yes, we can use polar coordinates now! We can use $x = r\cos\theta$ and $y = r\sin\theta$ , so $r^2 = x^2+y^2$ .

We know that –

Cartesian Area: A small rectangular area element, $dx\,dy$ , has the same size everywhere on the plane.
Polar Area: A small “polar patch,” $dr\,d\theta$ , does not have a constant area. A patch with a large radius $r$ is much bigger than a patch with a small radius, even if $dr$ and $d\theta$ are the same.

The fundamental relationship is that the probability in a small patch must be equal in both systems:

$f(x, y) \,dx\,dy = f(r, \theta) \,dr\,d\theta$
$f(r, \theta) = f(x, y) \cdot \frac{dx\,dy}{dr\,d\theta} = f(x, y) \cdot r$

This is the working of the jacobian determinant. Using this, we get –

$f(r, \theta) = r e^{-r^2/2} \cdot \frac{1}{2\pi}$

Algorithm

We see that the joint density factors into two parts: $f(r, \theta) = f_R(r) f_\Theta(\theta)$ , where:

$f_\Theta(\theta) = \frac{1}{2\pi}$ for $\theta \in [0, 2\pi)$ . This is the PDF of a uniform distribution.
$f_R(r) = r e^{-r^2/2}$ for $r \ge 0$ . This is the PDF of a Rayleigh distribution.

The Box-Muller algorithm exploits this by generating samples for $R$ and $\Theta$ and then transforming them back.

Generate two independent uniform random variables $u_1, u_2 \sim U(0, 1)$ .
Generate the angle $\theta = 2\pi u_1$ . This is a direct sample from the uniform distribution for the angle.
Generate the radius $r$ . The squared radius, $R^2$ , follows an exponential distribution with rate $\lambda=1/2$ . We can use Inverse Transform Sampling to generate a sample for $R^2$ from $u_2$ : $r^2 = -2\ln(u_2)$ . Thus, $r = \sqrt{-2\ln(u_2)}$ .
Convert back to Cartesian coordinates:

$x = r \cos\theta = \sqrt{-2\ln u_2} \cos(2\pi u_1)$

$y = r \sin\theta = \sqrt{-2\ln u_2} \sin(2\pi u_1)$

The resulting $x$ and $y$ are two independent samples from a standard normal distribution.

How to do this for any arbitrary thing?

To generate samples from an arbitrary Gaussian $\mathcal{N}(\mu, \sigma^2)$ , one simply generates a standard normal sample $z$ using Box-Muller and then applies the linear transformation: $x = \mu + \sigma z$ .

Duh.

Rejection Sampling

This is when we want to generate samples from $p(x)$ , but cannot sample from it directly, but we can evaluate its density (up to a constant).

It works by sampling from an easier-to-sample “proposal” distribution and then probabilistically accepting or rejecting those samples.

Setup

Target Distribution, $p(x)$ : The complex distribution we want to sample from. We only need to be able to evaluate a function $\tilde{p}(x)$ that is proportional to $p(x)$ , i.e., $p(x) = \tilde{p}(x)/Z$ where the normalization constant $Z$ may be unknown.
Proposal Distribution, $q(x)$ : A simpler distribution that we already know how to sample from (e.g., a uniform or Gaussian).
Acceptance Constant, $M$ : A constant chosen such that $M \cdot q(x) \ge \tilde{p}(x)$ for all $x$ . The function $M q(x)$ must form an “envelope” that is everywhere above our unnormalized target density.

Algorithm

Propose: Draw a sample $x_0$ from the proposal distribution, $x_0 \sim q(x)$ .
Evaluate: Calculate the acceptance ratio: $r = \frac{\tilde{p}(x_0)}{M q(x_0)}$ . This ratio is guaranteed to be in $[0, 1]$ .
Accept/Reject: Draw a sample from the standard uniform distribution, .
- If $u < r$ , accept the sample: $x = x_0$ .
- If $u \ge r$ , reject the sample and return to Step 1.

Interpretation

Yeah, what is even happening here?

Imagine the graphs of $\tilde{p}(x)$ and the envelope $M q(x)$ . The algorithm is equivalent to:

Sample a point $(x_0, y_0)$ uniformly from the area under the curve of the envelope $M q(x)$ .
If this point is also under the curve of the target density $\tilde{p}(x)$ , accept the x-coordinate $x_0$ . Otherwise, reject it.
This process correctly generates samples with a distribution of $p(x)$ .

Still don’t get it? Yeah, I didn’t either.

More intuition

For the dummies out there (like me!).

Imagine you want to sample random points from within the shape of an oddly-shaped pond (your complex target distribution), but you can only throw darts randomly at a large, rectangular tarp that covers the entire pond (your simple proposal distribution).

Throw a Dart: You throw a dart, and it lands somewhere on the tarp. This is Step 1: Propose.
Check Where it Landed: You look at the dart’s location.
Accept or Reject:
- If the dart landed in the pond, you accept its location as a valid sample.
- If the dart landed on the tarp but outside the pond, you reject it and throw another dart.

If you do this many times, the collection of accepted dart locations will be a perfect random sample from within the shape of the pond.

This is just the most basic thing, not the algorithm.

The shapes now have varying heights.

Propose (Throw a horizontal dart): When you draw a sample $x_{prop}$ from $q(x)$ , you are choosing a random horizontal position on the dartboard.
Accept/Reject (Throw a vertical dart): Now, at that horizontal position, you need to decide whether to keep the sample. This is done by throwing a “vertical” dart.
1. The height of the dartboard at this position is $M \cdot q(x_{prop})$ .
2. The height of your target shape is $\tilde{p}(x_{prop})$ .
3. The acceptance ratio $r = \frac{\tilde{p}(x_{prop})}{M \cdot q(x_{prop})}$ tells you how much of the dartboard’s height is “filled” by the target shape at that exact spot.
4. You then draw a random number $u$ between 0 and 1. If $u$ is less than or equal to the ratio $r$ , it means your “vertical dart” has landed underneath the curve of the target shape. You accept the sample. If not, it landed above the target but still on the dartboard, so you reject it.

This process ensures that you are more likely to accept samples in regions where the target distribution $\tilde{p}(x)$ is high and less likely to accept them where it’s low. Over many trials, the collection of accepted samples perfectly mimics the shape of the target distribution you wanted all along.

Ok, why not accept all points with r < 1?

That is, with the acceptance ratio less than 1?

The proposal distribution, $g(x)$ , generates samples according to its own shape. For example, if $g(x)$ is a uniform distribution, it will propose samples evenly across its domain. If we simply accepted every sample we generated, our final collection of points would just be a uniform distribution.

The random comparison, $u \le r$ , is a clever trick to “thin out” the samples from the proposal distribution so that their final density matches the target.

Think of the acceptance ratio, $r = \frac{\tilde{p}(x)}{M \cdot g(x)}$ , as the percentage of the envelope’s height that is “filled” by the target curve at a specific point $x$ .

Where the Target is High: In a region where $\tilde{p}(x)$ is high (a peak), the ratio $r$ will be close to 1. This means you have a very high chance of accepting a sample proposed in this region.
Where the Target is Low: In a region where $\tilde{p}(x)$ is low (a valley), the ratio $r$ will be close to 0. This means you have a very low chance of accepting a sample proposed here, and you’ll reject most of them.

This probabilistic step ensures that you keep more samples at the peaks and fewer samples in the valleys, perfectly sculpting the simple proposal distribution into the shape of your desired target distribution.

Makes sense? Yay!

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts