Vineeth Bhat

Confidence and Errors (Framework of Frequentist Hypothesis Testing)

October 5, 2025
What’s there?

I wanted to note down Type I and II errors, p-values and confidence intervals.

The central challenge in science and data analysis is to distinguish a real effect from random chance.

I wish to understand this.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Errors

Structure of a hypothesis test
1. The Null Hypothesis ( $H_0$ ): This is the default assumption, the hypothesis of “no effect” or “no difference.” It represents the status quo. For example, “ $H_0$ : The new drug has no effect on blood pressure.” The entire framework is designed to assess the strength of evidence against this hypothesis.
2. The Alternative Hypothesis ( $H_A$ or $H_1$ ): This is the claim we wish to investigate. It represents a real effect or a true difference. For example, “ $H_A$ : The new drug lowers blood pressure.”
Based on the data we collect, we must make a decision: either reject the null hypothesis (in favor of the alternative) or fail to reject the null hypothesis.

The errors themselves

Given the true state of the world (which is unknown to us) and the decision we make, there are four possible outcomes, two of which are errors:

Decision: Fail to Reject $H_0$ Decision: Reject $H_0$
Truth: $H_0$ is true Correct Decision Type I Error (False Positive)
Truth: $H_A$ is true Type II Error (False Negative) Correct Decision (True Positive)
- Type I Error (False Positive):
  - Definition: A Type I error occurs when we reject the null hypothesis when it is actually true.
  - Probability: The probability of committing a Type I error is denoted by $\alpha$ . This value is called the significance level of the test.
    
    We control for this error by choosing a small value for $\alpha$ before conducting the experiment (commonly $\alpha = 0.05$ or $\alpha = 0.01$ ). This means we are willing to accept a 5% or 1% chance of making a false positive conclusion.
- Type II Error (False Negative):
  - Definition: A Type II error occurs when we fail to reject the null hypothesis when it is actually false.
  - Probability: The probability of committing a Type II error is denoted by $\beta$ . The value $1-\beta$ is called the statistical power of the test. It is the probability of correctly detecting a real effect when it exists.
    
    The value of $\beta$ is not fixed in advance but depends on the true size of the effect, the sample size, and the chosen significance level $\alpha$ .
    
    There is an inherent tradeoff between $\alpha$ and $\beta$ ; decreasing the chance of a Type I error generally increases the chance of a Type II error. This makes sense intuitively, right?
    
    a lower α (a stricter test) makes it harder to reject the null hypothesis, thus increasing β.
p-values

First principles
- The Core Question: “If the null hypothesis were true, how surprising is my data?”
- The Test Statistic: We first compute a test statistic from our data (e.g., a t-statistic, a chi-squared statistic). This is a single number that summarizes the deviation of our data from what would be expected under the null hypothesis.
- The Null Distribution: We must know the probability distribution that this test statistic would follow if the null hypothesis were true. This is the null distribution
Theory

Formal Definition (p-value): The p-value is the probability, assuming the null hypothesis ( $H_0$ ) is true, of observing a test statistic that is at least as extreme as the one actually observed.

$p = P(\text{Test Statistic} \ge \text{Observed Statistic} | H_0 \text{ is true})$

(Mis)interpretations
- What it IS: The p-value is a measure of surprise. A small p-value means that our observed data is very surprising if the null hypothesis is true. This leads us to doubt the null hypothesis.
- What it is NOT: The p-value is NOT the probability that the null hypothesis is true. This is the most common and critical misinterpretation. $p \neq P(H_0 | \text{Data})$ . Calculating $P(H_0 | \text{Data})$ requires a Bayesian approach involving a prior probability for $H_0$ .
The Decision Rule: To make a decision, we compare the p-value to our pre-specified significance level $\alpha$ .
- If $p \le \alpha$ : We reject the null hypothesis. The result is deemed “statistically significant.” We conclude that there is strong evidence for the alternative hypothesis.
- If $p > \alpha$ : We fail to reject the null hypothesis. The result is “not statistically significant.” We conclude that we do not have sufficient evidence to discard the null hypothesis (this is NOT the same as proving the null hypothesis is true).
With an example

You am a botanist (yay!). You have developed a new fertilizer. Does it make plants actually grow taller?

Your core question is: “If this fertilizer has no effect at all, how surprising is it that my fertilized plants grew so much taller than the unfertilized ones?”

To answer this, you set up an experiment:
- You take 60 seedlings.
- 30 seedlings get the new fertilizer (the treatment group).
- 30 seedlings get only water (the control group).
After a month, you measure all the plants. You find that the fertilized plants are, on average, 2 cm taller. To standardize this result, you calculate a t-statistic. This single number boils down the difference in means, the variation in the data, and the sample size. Let’s say you calculate a t-statistic of 2.5.

Now, you need to know if t = 2.5 is a big deal. You turn to theory. Statisticians know that if the fertilizer had no effect (the null hypothesis is true), and you repeated this experiment many times, the t-statistics you’d get would form a specific bell-shaped curve called a t-distribution, centered at 0. This is your null distribution.

You check your null distribution to see how your result (t = 2.5) fits in. The p-value is the probability of getting a t-statistic that is at least as extreme as 2.5, assuming the fertilizer is useless. “Extreme” here means 2.5 or greater, OR -2.5 or less.

Let’s say the area under the tails of the curve for these extreme values is 0.012. So, your p-value = 0.012.

The p-value of 0.012 is a measure of surprise. It means that if the fertilizer had no effect, there would only be a 1.2% chance of seeing a height difference as large as the one you observed just by random luck. Because this chance is so small, you should be very surprised. This surprise leads you to strongly doubt your initial assumption that the fertilizer is useless.

This is the critical part. The p-value of 0.012 does NOT mean there’s a 1.2% chance that the null hypothesis is true (i.e., a 1.2% chance the fertilizer is useless). It’s a statement about your data’s rarity under the null hypothesis, not a statement about the hypothesis itself.

Before the experiment, you set a significance level, your threshold for surprise. Let’s use the standard $\alpha = 0.05$ .
- Compare: Your p-value (0.012) is smaller than your alpha (0.05).
- Decision: Since p < $\alpha$ , you reject the null hypothesis.
- Conclusion: The result is “statistically significant.” You have strong evidence to conclude that your new fertilizer does, in fact, make plants grow taller.
What if your p-value was 0.34?
In that case, since 0.34 > 0.05, you would fail to reject the null hypothesis. This doesn’t prove the fertilizer is useless. It just means this particular experiment didn’t provide strong enough evidence to convince you that it works.

Confidence intervals

Fundamentals
- The Problem: A point estimate (like the sample mean $\bar{x}$ ) is our single best guess for an unknown population parameter (the true mean $\mu$ ). However, we know this estimate is almost certainly not exactly correct due to sampling variability. A confidence interval provides a range of plausible values for the true parameter.
- The Frequentist Perspective: This is a subtle but crucial point. From a frequentist perspective, the true population parameter $\mu$ is a fixed, unknown constant. It is not a random variable. The confidence interval, however, is random. Its endpoints are calculated from the random sample, so they would be different if we were to repeat the experiment.
Theory

Formal Definition (Confidence Interval): A $100(1-\alpha)\%$ confidence interval for a parameter $\theta$ is an interval $[L, U]$ , computed from the sample, such that if we were to repeat the sampling process an infinite number of times, $100(1-\alpha)\%$ of the thus-constructed intervals would contain the true, fixed parameter $\theta$ .

$P(L \le \theta \le U) = 1-\alpha$

This probability statement is about the procedure of constructing the interval, not about the specific interval we have calculated.

Example

Construction (Example for a Mean): A confidence interval for a population mean $\mu$ is typically constructed as:

$\text{Point Estimate} \pm \text{Margin of Error}$

$\bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}$

where:
- $\bar{x}$ is the sample mean.
- $s$ is the sample standard deviation.
- $n$ is the sample size.
- $t^*$ is the critical value from the Student’s t-distribution with $n-1$ degrees of freedom, chosen such that the area between $-t^*$ and $+t^*$ is $1-\alpha$ .
(Mis)interpretations
- What it IS: If we calculate a 95% confidence interval for the mean to be [10.2, 11.8], the correct interpretation is: “We are 95% confident that the procedure we used to generate this interval captures the true population mean.” It is a statement about the reliability of our method.
- What it is NOT: It is NOT correct to say: “There is a 95% probability that the true mean $\mu$ lies within the interval [10.2, 11.8].” This is incorrect because in the frequentist paradigm, the true mean $\mu$ is a fixed constant, not a random variable; it is either in the interval or it is not. A probability statement about a fixed constant is either 0 or 1.
Sources
1. Mathematics for Machine Learning (link)
Information Theory

October 5, 2025
What’s there?

Basic principles of information theory – entropy, KL divergence.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Entropy

Brain dump

The Concept of Information (or “Surprisal”): The foundational idea is that learning the outcome of an event is more “informative” if that event was unlikely.

Axioms for an Information Measure: Shannon proposed that a measure of information, $I(p)$ , associated with an event of probability $p$ , should have the following properties:
1. Information is non-negative: $I(p) \ge 0$ .
2. If an event is certain ( $p=1$ ), its information content is zero: $I(1) = 0$ .
3. If an event is less likely, it is more informative: if $p_1 < p_2$ , then $I(p_1) > I(p_2)$ .
4. The information gained from observing two independent events is the sum of their individual information contents: $I(p_1 p_2) = I(p_1) + I(p_2)$ .
The Information Formula: The unique function that satisfies these axioms is the negative logarithm:

$I(p) = -\log_b(p)$

The base of the logarithm, $b$ , determines the units of information. If $b=2$ , the unit is bits. If $b=e$ (natural log), the unit is nats. In machine learning, we typically use the natural logarithm. An event with probability $1/2$ has $-\log_2(1/2) = 1$ bit of information.

Theoretical development
- Formal Definition (Entropy): Let $X$ be a discrete random variable that can take on $K$ possible states ${x_1, \dots, x_K}$ with probabilities $P(X=x_k) = p_k$ . The Shannon entropy of the random variable $X$ , denoted $H(X)$ or $H(P)$ , is:
$H(X) = \mathbb{E}[I(P(X))] = \sum_{k=1}^K p_k I(p_k) = -\sum_{k=1}^K p_k \log p_k$

By convention, $0 \log 0 = 0$ .
- Properties of Entropy:
  - Minimum Entropy: The entropy is minimal ( $H(X)=0$ ) when there is no uncertainty. This occurs when one outcome is certain ( $p_k=1$ for some $k$ ) and all others are impossible ( $p_j=0$ for $j \neq k$ ).
  - Maximum Entropy: The entropy is maximal when there is the most uncertainty. For a variable with $K$ states, this occurs when all outcomes are equally likely, i.e., the distribution is uniform ( $p_k = 1/K$ for all $k$ ). In this case, $H(X) = \log K$ .
  - Geometric Interpretation: Entropy measures the “flatness” or “peakedness” of a probability distribution. A peaked distribution has low entropy; a flat distribution has high entropy.
Cross-Entropy

Theory
- Formal Definition (Cross-Entropy): Let $P$ be the true probability distribution over a set of events, and $Q$ be our model’s predicted probability distribution over the same set of events. The cross-entropy of $Q$ with respect to $P$ is:
$H(P, Q) = \mathbb{E}_{X \sim P}[I(Q(X))] = \sum_{k=1}^K p_k I(q_k) = -\sum_{k=1}^K p_k \log q_k$

Notice the structure: we are averaging the information content of the model’s probabilities ( $-\log q_k$ ) over the true probabilities ( $p_k$ ).

What does this mean?

Think about encoding the letters of the English alphabet into Morse code.
- True Distribution ( $P$ ): In real English text, the letter ‘E’ is extremely common, while ‘Z’ is very rare. An efficient code would give ‘E’ a very short representation (like a single dot) and ‘Z’ a much longer one. This is the optimal encoding scheme based on the true frequency of letters.
- Model Distribution ( $Q$ ): Now, suppose you mistakenly believe that ‘Z’ is the most common letter and ‘E’ is the rarest. You design a new Morse code based on this incorrect assumption. You’d give ‘Z’ a short code and ‘E’ a very long, inefficient one.
Cross-entropy is the average length of the messages you’d have to send using your inefficient, ‘Z’-focused code to communicate actual English text. Because your code is optimized for the wrong reality, your messages will be, on average, much longer than necessary.

The extra length is the “penalty” for your incorrect assumption.

Deconstructing the formula –

The cross-entropy of $Q$ relative to $P$ is calculated as:

$H(P, Q) = -\sum_{x} P(x) \log Q(x)$

Let’s break this down:
- $P(x)$ : The true probability of an event $x$ . This acts as a weight. We care more about events that actually happen often.
- $\log Q(x)$ : This relates to the “cost” of encoding event $x$ using our model’s assumed probability. If our model says an event is very likely ( $Q(x)$ is high), the cost is low. If it says an event is very unlikely ( $Q(x)$ is low), the cost is very high.
- $-\sum_{x}$ : We sum this “weighted cost” over all possible events to get the average cost, or the cross-entropy.
Relationship with KL Divergence

Cross-entropy is intimately related to Shannon entropy and the Kullback-Leibler (KL) Divergence, which measures the “extra” bits/nats needed due to using the wrong code.

$\begin{aligned} H(P, Q) &= -\sum_k p_k \log q_k \ &= -\sum_k p_k \log \left( \frac{q_k}{p_k} p_k \right) \ &= -\sum_k p_k \log p_k - \sum_k p_k \log \frac{q_k}{p_k} \ &= H(P) + D_{KL}(P || Q) \end{aligned}$

Cross-Entropy = Entropy + KL Divergence.

This is like super important, since
In machine learning, the true data distribution $P$ is fixed (but unknown). We are trying to find a model $Q$ that is as close to $P$ as possible. Since $H(P)$ is a constant with respect to our model, minimizing the cross-entropy $H(P, Q)$ is equivalent to minimizing the KL divergence $D_{KL}(P || Q)$ . Therefore, cross-entropy serves as an excellent loss function.

Kullback-Leibler (KL) Divergence

Theory

Let $P(x)$ and $Q(x)$ be two probability distributions over the same random variable $x$ . The KL Divergence of $Q$ from $P$ , denoted $D_{KL}(P || Q)$ , is defined as the expectation of the logarithmic difference between the two distributions, where the expectation is taken with respect to the true distribution $P$ .
- For discrete distributions:
$D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$
- For continuous distributions:
$D_{KL}(P || Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx$

With cross entropy

As shown earlier, $D_{KL}(P || Q) = H(P, Q) - H(P)$ .
The KL divergence (the penalty) is the difference between the average message length using the wrong code ( $H(P,Q)$ ) and the average message length using the optimal code ( $H(P)$ ).

Minimizing and maximising
- Minimizing $D_{KL}(P || Q)$ : This is the “forward KL” used in Maximum Likelihood Estimation. It forces $Q$ to be non-zero wherever $P$ is non-zero. This leads to “mode-covering” or “zero-avoiding” behavior, where the model $Q$ tries to be broad enough to capture all the probability mass of the true distribution $P$ .
- Minimizing $D_{KL}(Q || P)$ : This is the “reverse KL” used in Variational Inference. It forces $Q$ to be zero wherever $P$ is zero. This leads to “mode-seeking” or “zero-forcing” behavior, where the approximation $Q$ tends to find and fit one of the modes of a multi-modal true distribution $P$ .
Um, what?

Mode Covering

This version asks, “What’s the penalty for my model $Q$ failing to account for something that is actually true ( $P$ )?”
- The Penalty: Look at the term $P(x) \log\left(\frac{P(x)}{Q(x)}\right)$ . If there is a place where the true probability is high ( $P(x) > 0$ ) but your model assigns it a near-zero probability ( $Q(x) \to 0$ ), the ratio $\frac{P(x)}{Q(x)}$ becomes huge. The logarithm of this huge number is also huge, resulting in an infinite KL divergence.
- Behavior (“Zero-Avoiding”): To avoid this infinite penalty, the model is forced to place some probability mass everywhere the true distribution has mass. It learns to avoid assigning zero probability where it shouldn’t.
- Result (“Mode-Covering”): If the true distribution $P$ has multiple peaks (is multi-modal), the model $Q$ must stretch itself out to cover all of them to avoid the infinite penalty. This results in a broad, “averaged” approximation that covers all modes, even if it doesn’t perfectly capture any single one. This is why it’s used in Maximum Likelihood Estimation, where the goal is to find a single set of parameters that best explains all the observed data.
Mode Seeking

This version flips the question and asks, “What’s the penalty for my model $Q$ hallucinating something that is actually impossible ( $P$ )?”
- The Penalty: The formula for reverse KL is $D_{KL}(Q || P) = \sum_x Q(x) \log\left(\frac{Q(x)}{P(x)}\right)$ . Now, consider a place where your model assigns some probability ( $Q(x) > 0$ ), but the true probability is zero ( $P(x) = 0$ ). The ratio $\frac{Q(x)}{P(x)}$ becomes infinite, again leading to an infinite KL divergence.
- Behavior (“Zero-Forcing”): To avoid this infinite penalty, the model is forced to have zero probability wherever the true distribution has zero probability. It learns to only place probability mass in regions where it’s sure the true distribution exists.
- Result (“Mode-Seeking”): If the true distribution $P$ has multiple peaks separated by regions of zero probability, the model $Q$ cannot stretch to cover them all because that would force it to place probability in the zero-regions, resulting in an infinite penalty. Instead, the safest strategy is to pick one single mode of $P$ and fit it as well as possible. This is why it’s used in Variational Inference, where we often want a simpler, tractable approximation ( $Q$ ) that captures the most significant part of a complex true posterior distribution ( $P$ ).
Mutual Information

Understanding
How much does knowing the value of variable $Y$ reduce my uncertainty about variable $X$ ?

Recall that entropy, $H(X)$ , measures our total uncertainty about $X$ . After we observe the value of $Y$ , our uncertainty about $X$ is now given by the conditional entropy, $H(X|Y)$ . Mutual Information is simply the reduction in uncertainty.

$\text{Mutual Information} = (\text{Initial Uncertainty about X}) - (\text{Remaining Uncertainty about X after seeing Y})$

Theory

The Mutual Information between two random variables $X$ and $Y$ , denoted $I(X; Y)$ , is defined as:

$I(X; Y) = H(X) - H(X|Y)$

where $H(X)$ is the entropy of $X$ and $H(X|Y) = -\sum_{x,y} p(x,y)\log p(x|y)$ is the conditional entropy.

In terms of KL Divergence

$I(X; Y) = D_{KL}( p(x, y) || p(x)p(y) )$
- $p(x, y)$ is the true joint distribution of the two variables, capturing all dependencies between them.
- $p(x)p(y)$ is the hypothetical joint distribution that we would have if the two variables were perfectly independent.
- Therefore, the Mutual Information is the KL divergence from the independent case to the true joint case. It measures how much the true joint distribution deviates from the assumption of independence. If the variables are truly independent, then $p(x,y)=p(x)p(y)$ , and the KL divergence is zero.
Sources
1. Mathematics for Machine Learning (link)

	Decision: Fail to Reject $H_0$	Decision: Reject $H_0$
Truth: $H_0$ is true	Correct Decision	Type I Error (False Positive)
Truth: $H_A$ is true	Type II Error (False Negative)	Correct Decision (True Positive)

Estimation Theory (The Principles of Statistical Parameter Estimation)

September 14, 2025

What’s there?

As an introduction to ML, I need to understand estimation theory – MLE, MAP and Bayesian Estimation.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Maximum Likelihood Estimation

We have covered this in brief earlier as well in PCA, Spectral Clustering and t-SNE.

Foundational principle

This answers

“What set of parameter values makes the observed data most probable?”

Likelihood Function: We begin with a parameterized probability model $p(\mathcal{D} | \boldsymbol{\theta})$ , where $\mathcal{D} = \{\mathbf{x}_n\}_{n=1}^N$ is the observed dataset and $\boldsymbol{\theta}$ is the vector of model parameters. The likelihood function, $L(\boldsymbol{\theta}; \mathcal{D})$ , is defined as this probability, but viewed as a function of the parameters $\boldsymbol{\theta}$ , holding the data $\mathcal{D}$ fixed:

$L(\boldsymbol{\theta}; \mathcal{D}) = p(\mathcal{D} | \boldsymbol{\theta})$

The principle of maximum likelihood states that the optimal estimate for the parameters, $\hat{\boldsymbol{\theta}}_{\text{ML}}$ , is the value that maximizes the likelihood function.

$\hat{\boldsymbol{\theta}}_{\text{ML}} = \arg\max_{\boldsymbol{\theta}} L(\boldsymbol{\theta}; \mathcal{D})$

How is it usually done?

For computational stability and mathematical convenience, we almost always work with the log-likelihood. Since the logarithm is a strictly monotonically increasing function, maximizing the likelihood is equivalent to maximizing the log-likelihood. For an i.i.d. dataset, the likelihood is a product, which the logarithm turns into a sum:

$\hat{\boldsymbol{\theta}}_{\text{ML}} = \arg\max_{\boldsymbol{\theta}} \sum_{n=1}^N \log p(\mathbf{x}_n | \boldsymbol{\theta})$

Overfitting

MLE’s primary drawback is its tendency to overfit, especially with small datasets. It only considers the data it has seen, with no mechanism for regularization or expressing prior beliefs.

So, say, we flipped a coin three times and got only heads; this is our dataset. Then, our MLE estimation will always predict a head.

Maximum a Posteriori (MAP) Estimation

MAP estimation extends MLE by incorporating prior beliefs about the parameters.

Foundational principles

The Bayesian Framework: Unlike the frequentist view, the Bayesian perspective treats the model parameters $\boldsymbol{\theta}$ not as fixed constants, but as random variables about which we can have beliefs.

The Prior Distribution ( $p(\boldsymbol{\theta})$ ): We encode our initial beliefs about the parameters before seeing any data in a prior probability distribution, $p(\boldsymbol{\theta})$ . This allows us to specify which parameter values are more or less plausible.

For instance, a Gaussian prior centered at zero, $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \alpha^2\mathbf{I})$ , expresses a belief that smaller parameter values are more likely.

The Posterior Distribution ( $p(\boldsymbol{\theta}|\mathcal{D})$ ): After observing data $\mathcal{D}$ , we update our beliefs using Bayes’ theorem to obtain the posterior distribution:

$p(\boldsymbol{\theta} | \mathcal{D}) = \frac{p(\mathcal{D} | \boldsymbol{\theta}) \, p(\boldsymbol{\theta})}{p(\mathcal{D})} \propto \text{Likelihood} \times \text{Prior}$

The MAP Principle: MAP estimation does not use the full posterior distribution. Instead, it seeks the single “best” point estimate for $\boldsymbol{\theta}$ by finding the mode of the posterior distribution. This is the point of highest probability density.

$\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} p(\boldsymbol{\theta} | \mathcal{D}) = \arg\max_{\boldsymbol{\theta}} \left[ p(\mathcal{D} | \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \right]$

Connection to regularization

In linear regression, if we place a Gaussian prior on the parameters, $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \alpha^2\mathbf{I})$ , the log-prior is $\log p(\boldsymbol{\theta}) = -\frac{1}{2\alpha^2}|\boldsymbol{\theta}|^2 + \text{const}$ . Maximizing the log-posterior is equivalent to minimizing the negative log-posterior –

$\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\min_{\boldsymbol{\theta}} \left[ \sum_{n=1}^N (y_n - \mathbf{x}_n^\top\boldsymbol{\theta})^2 + \lambda |\boldsymbol{\theta}|^2 \right] \quad (\text{where } \lambda \propto 1/\alpha^2)$

This is the objective function for Ridge Regression (L2 regularization).

MAP estimation with a Gaussian prior is equivalent to L2 regularization.

Is it perfect?

Nope.

MAP still only provides a point estimate ( $\hat{\boldsymbol{\theta}}_{\text{MAP}}$ ). It finds the most probable parameters but discards all other information contained in the posterior distribution, such as its variance or shape.

It doesn’t quantify our uncertainty about the parameters.

Full Bayesian Estimation

“Full Bayesian estimation embraces the entire philosophy of the Bayesian framework. It does not seek a single best point estimate for the parameters; instead, its goal is to compute and use the entire posterior distribution.”

See? So good. Literally standing on the shoulders of giants all the time.

Basics

The primary goal is not parameter estimation itself, but making predictions.

Bayesian estimation propagates the full parameter uncertainty through to the predictions. The output of the learning process is not a single value $\hat{\boldsymbol{\theta}}$ , but the full probability distribution $p(\boldsymbol{\theta}|\mathcal{D})$ .

Theoretical development

The Predictive Distribution: To make a prediction for a new data point $\mathbf{x}_*$ , we do not use a single point estimate. Instead, we compute the posterior predictive distribution by averaging the predictions of all possible parameter values, weighted by their posterior probability:

$p(y_* | \mathbf{x}_*, \mathcal{D}) = \int p(y_* | \mathbf{x}_*, \boldsymbol{\theta}) \, p(\boldsymbol{\theta} | \mathcal{D}) \, d\boldsymbol{\theta}$

This can be written as an expectation:

$p(y_* | \mathbf{x}_*, \mathcal{D}) = \mathbb{E}_{\boldsymbol{\theta} \sim p(\boldsymbol{\theta}|\mathcal{D})} [p(y_* | \mathbf{x}_*, \boldsymbol{\theta})]$

The Procedure: The core task of full Bayesian estimation is integration, not optimization. We must solve the integral above, as well as the integral in the denominator of Bayes’ theorem (the marginal likelihood or evidence), $p(\mathcal{D}) = \int p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})d\boldsymbol{\theta}$ .

Summary

Method MLE MAP Full Bayesian
Core Principle Maximize Likelihood Maximize posterior Maximize Posterior
View of Parameters $\boldsymbol{\theta}$ Fixed, unknown constant Random variable (but we only find its mode) Random variable (we use its full distribution)
Core Task Optimization Optimization (Regularized) Integration
Output of “Learning” A point estimate, $\hat{\boldsymbol{\theta}}_{\text{ML}}$ A point estimate, $\hat{\boldsymbol{\theta}}_{\text{MAP}}$ A probability distribution, $p(\boldsymbol{\theta})$
Handles Overfitting? No (prone to overfitting) Yes (via the prior, which acts as a regularizer) Yes (by averaging over all parameters, it naturally avoids overfitting)
Quantifies Uncertainty? No No Yes (the primary advantage)

Method	MLE	MAP	Full Bayesian
Core Principle	Maximize Likelihood	Maximize posterior	Maximize Posterior
View of Parameters $\boldsymbol{\theta}$	Fixed, unknown constant	Random variable (but we only find its mode)	Random variable (we use its full distribution)
Core Task	Optimization	Optimization (Regularized)	Integration
Output of “Learning”	A point estimate, $\hat{\boldsymbol{\theta}}_{\text{ML}}$	A point estimate, $\hat{\boldsymbol{\theta}}_{\text{MAP}}$	A probability distribution, $p(\boldsymbol{\theta})$
Handles Overfitting?	No (prone to overfitting)	Yes (via the prior, which acts as a regularizer)	Yes (by averaging over all parameters, it naturally avoids overfitting)
Quantifies Uncertainty?	No	No	Yes (the primary advantage)

Sources

Mathematics for Machine Learning (link)

Gamma Family of Distributions

September 14, 2025
What’s there?

This one explores some more distributions – Gamma, Chi-Squared, and Student’s t-distributions. Why?
- The Gamma distribution provides a flexible model for positive continuous random variables, such as waiting times.
- The Chi-Squared and Student’s t-distributions are special cases and transformations derived from the Gamma and Normal distributions.
Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Gamma

It is a continuous distribution defined on the positive real line. This just means that it’s defined for $x > 0$ .

Foundations

We’ve described the gamma function in Uniform, Exponential, Beta & Dirichlet. As a recap, it is,

$\Gamma(z) = \int_0^\infty t^{z-1}e^{-t} dt$

Back to the topic, what are we even trying to do? What’s the intuitive goal?

We want to model the waiting time until the $\alpha$ -th event occurs in a Poisson process with a rate of $\beta$ .

Recall the Exponential distribution models the waiting time for the first event. The Gamma distribution generalizes this to the sum of waiting times for multiple events.

Before everything, I just want to recap Poisson from Fundamental Discrete Probability Distributions.
The Poisson distribution is a statistical tool used to model the probability of a certain number of events happening within a fixed interval of time or space, provided these events occur at a known constant average rate and independently of the time since the last event.

For instance, “How many emails will I receive in the next hour?”

$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$

Where:
- k is the number of events you are interested in (e.g., 5 phone calls).
- λ (lambda) is the average number of events per interval (e.g., an average of 3 phone calls per hour).
- e is Euler’s number (approximately 2.71828).
- k! is the factorial of k.
Theory

Formal Definition: A continuous random variable $X$ follows a Gamma distribution, denoted $X \sim \text{Gamma}(\alpha, \beta)$ , if its PDF is:

$f(x | \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}, \quad x > 0$

The distribution is governed by two parameters:
1. Shape parameter ( $\alpha > 0$ ): Controls the shape of the distribution. For $\alpha \le 1$ , the density is monotonically decreasing. For $\alpha > 1$ , it has a single peak. As $\alpha \to \infty$ , the Gamma distribution approaches a Normal distribution.
2. Rate parameter ( $\beta > 0$ ): Controls the scale of the distribution (inverse of the scale parameter).
- Moments:
  - Expectation: $\mathbb{E}[X] = \alpha/\beta$
  - Variance: $\text{Var}(X) = \alpha/\beta^2$
- Relationship to other distributions:
  - If $\alpha=1$ , the Gamma distribution becomes the Exponential distribution: $\text{Gamma}(1, \beta) = \text{Exp}(\beta)$ .
  - The sum of $k$ independent $\text{Exp}(\beta)$ variables is a $\text{Gamma}(k, \beta)$ variable.
  - The Chi-Squared distribution is a special case of the Gamma distribution.
Why is this cool?

The sum of independent Gamma-distributed random variables with the same rate parameter is also a Gamma-distributed variable. Specifically, if $X_i \sim \text{Gamma}(\alpha_i, \beta)$ are independent, then:

$\sum_{i=1}^k X_i \sim \text{Gamma}\left(\sum_{i=1}^k \alpha_i, \beta\right)$

The shape parameters add, while the rate parameter remains the same.

Chi-Squared ( $\chi^2$ )

First principles
- The Core Definition: Let $Z_1, Z_2, \dots, Z_k$ be $k$ independent random variables, each following a standard normal distribution, $Z_i \sim \mathcal{N}(0, 1)$ . The Chi-Squared distribution with $k$ degrees of freedom, denoted $\chi^2_k$ , is the distribution of the sum of the squares of these variables:
$Q = \sum_{i=1}^k Z_i^2 \sim \chi^2_k$
- Degrees of Freedom ( $k$ ): The single parameter of the distribution, $k$ , is the degrees of freedom. It corresponds to the number of independent standard normal variables being summed. It dictates the shape of the distribution.
Fundamental theorems of calculus

First one

If $F$ is an antiderivative of a continuous function $f$ on an interval $[a, b]$ , then:

$\int_a^b f(x) \,dx = F(b) - F(a)$

Second one

If $f$ is a continuous function on an interval, then for any $a$ in that interval, the function defined by $F(x) = \int_a^x f(t) \,dt$ is an antiderivative of $f$ . That is:

$\frac{d}{dx} \int_a^x f(t) \,dt = f(x)$

Ok, cool, what if $a$ was function of $x$ ? This is the Leibniz Integral Rule.

If you have an integral of the form:

$F(x) = \int_{a(x)}^{b(x)} f(t) \,dt$

Its derivative with respect to $x$ is:

$\frac{dF}{dx} = f(b(x)) \cdot b'(x) - f(a(x)) \cdot a'(x)$

As a special case of Gamma

Step 1: Find the distribution of $Y = Z^2$ where $Z \sim \mathcal{N}(0, 1)$ .

We can derive this using the change of variables technique. Let $F_Y(y)$ be the CDF of $Y$ . For $y > 0$ :

$\begin{aligned} F_Y(y) &= P(Y \le y) \\ &= P(Z^2 \le y) \\ &= P(-\sqrt{y} \le Z \le \sqrt{y}) \end{aligned}$

This probability is the integral of the standard normal PDF, $\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-z^2/2}$ , between $-\sqrt{y}$ and $\sqrt{y}$ :

$F_Y(y) = \int_{-\sqrt{y}}^{\sqrt{y}} \frac{1}{\sqrt{2\pi}}e^{-z^2/2} dz$

We find the pdf using the Leibniz Integral rule –

$\begin{aligned} f_Y(y) = \frac{d}{dy} F_Y(y) &= \phi(\sqrt{y}) \cdot \frac{d}{dy}(\sqrt{y}) - \phi(-\sqrt{y}) \cdot \frac{d}{dy}(-\sqrt{y}) \\ &= \frac{1}{\sqrt{2\pi}}e^{-y/2} \cdot \frac{1}{2\sqrt{y}} - \frac{1}{\sqrt{2\pi}}e^{-y/2} \cdot \left(-\frac{1}{2\sqrt{y}}\right) \\ &= 2 \cdot \frac{1}{\sqrt{2\pi}}e^{-y/2} \cdot \frac{1}{2\sqrt{y}} \\ &= \frac{1}{\sqrt{2\pi}} y^{-1/2} e^{-y/2} \end{aligned}$

Now, we compare this PDF to the Gamma PDF, $f_X(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1}e^{-\beta x}$ .
We need to match the terms.
- The exponential term $e^{-y/2}$ suggests the rate parameter is $\beta = 1/2$ .
- The polynomial term $y^{-1/2}$ suggests that the shape parameter is $\alpha-1 = -1/2 \implies \alpha = 1/2$ .
  Let’s check if the normalization constant matches. For $\text{Gamma}(1/2, 1/2)$ , the constant is $\frac{(1/2)^{1/2}}{\Gamma(1/2)}$ . It is a known property of the Gamma function that $\Gamma(1/2) = \sqrt{\pi}$ . So the constant is $\frac{1/\sqrt{2}}{\sqrt{\pi}} = \frac{1}{\sqrt{2\pi}}$ .
  This matches perfectly. Therefore, we have proven that the square of a single standard normal variable follows a Gamma distribution:
$Z^2 \sim \text{Gamma}\left(\alpha=\frac{1}{2}, \beta=\frac{1}{2}\right)$

A Chi-Squared distribution with one degree of freedom is identical to this Gamma distribution: $\chi^2_1 = \text{Gamma}(1/2, 1/2)$ .

Step 2: Find the distribution of $Q = \sum_{i=1}^k Z_i^2$ .

Since each $Z_i^2$ is an independent $\text{Gamma}(1/2, 1/2)$ random variable, we can use the additive property of the Gamma distribution. The sum of $k$ such variables will be:

$Q = \sum_{i=1}^k Z_i^2 \sim \text{Gamma}\left(\sum_{i=1}^k \frac{1}{2}, \frac{1}{2}\right) = \text{Gamma}\left(\frac{k}{2}, \frac{1}{2}\right)$

By definition, $Q \sim \chi^2_k$ . Thus, we have established the equivalence:

$\chi^2_k = \text{Gamma}\left(\alpha = \frac{k}{2}, \beta = \frac{1}{2}\right)$

Where is this distribution used?

Example 1: Testing the Variance of a Population (Goodness-of-Fit)
- Scenario: A manufacturer produces bolts that are supposed to have a diameter variance of $\sigma^2 = 0.01 \text{ mm}^2$ . To perform quality control, a sample of $n=20$ bolts is taken, and their sample variance is calculated to be $s^2 = 0.015 \text{ mm}^2$ . Does this provide significant evidence that the true variance of the manufacturing process has increased?
- Application: Under the null hypothesis that the population is normal and the true variance is indeed $\sigma^2=0.01$ , the test statistic:
$\chi^2_{\text{stat}} = \frac{(n-1)s^2}{\sigma^2} = \frac{(19)(0.015)}{0.01} = 28.5$

follows a Chi-Squared distribution with $n-1=19$ degrees of freedom. We can then compare this value to the $\chi^2_{19}$ distribution. We would calculate the probability $P(\chi^2_{19} \ge 28.5)$ . If this probability (the p-value) is very low (e.g., < 0.05), we would reject the null hypothesis and conclude that the manufacturing process variance has likely increased.

Example 2: Likelihood-Ratio Tests in Machine Learning
- Scenario: In machine learning, we often want to compare two nested models: a simpler model (e.g., linear regression) and a more complex model that includes the simpler one as a special case (e.g., polynomial regression). We want to know if the additional complexity of the second model provides a statistically significant improvement in fit.
- Application: Let $L_0$ be the maximum likelihood value for the simple model and $L_1$ be the maximum likelihood for the complex model. The likelihood-ratio test statistic is:
$D = -2 \log \left( \frac{L_0}{L_1} \right) = 2(\log L_1 - \log L_0)$
According to Wilks’s theorem, as the number of data points becomes large, this statistic $D$ approximately follows a Chi-Squared distribution. The degrees of freedom $k$ are equal to the difference in the number of free parameters between the two models. This provides a formal statistical test for model selection and is used to justify adding parameters to a model.

We will discuss this later as well under statistics.

Student’s t-

Why does this exist?

Z-statistic

The formula $Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$ is a way to standardize the sample mean. It tells us how many standard errors our sample mean ( $\bar{X}$ ) is away from the true population mean ( $\mu$ ). If we know the true population standard deviation ( $\sigma$ ), this Z-score will perfectly follow a standard normal distribution (a bell curve with a mean of 0 and a standard deviation of 1). T

This is wonderful because the properties of the normal distribution are well-understood, making it easy to calculate probabilities and test hypotheses.

But, we know neither $\sigma$ nor $\mu$ .

This is where the Student’s t-distribution comes in. It was developed specifically to solve this problem.

Since we don’t know the population standard deviation ( $\sigma$ ), we have to estimate it using the sample standard deviation ( $s$ ). We then substitute $s$ into the formula, creating a new quantity called the t-statistic:

$t = \frac{\bar{X}-\mu}{s/\sqrt{n}}$

To account for this added uncertainty, the t-statistic doesn’t follow a perfect normal distribution. Instead, it follows a t-distribution.

Theoretical development

Formal Definition: Let $Z \sim \mathcal{N}(0, 1)$ be a standard normal random variable and $V \sim \chi^2_k$ be a Chi-Squared random variable with $k$ degrees of freedom. If $Z$ and $V$ are independent, then the random variable $T$ defined as:

$T = \frac{Z}{\sqrt{V/k}}$

follows a Student’s t-distribution with $k$ degrees of freedom, denoted $T \sim t_k$ .

Derivation of the t-statistic: In the context of the sample mean, we have $Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}}$ and we know that $V = \frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1}$ . Substituting these into the definition of $T$ :

$T = \frac{\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}}{\sqrt{\frac{(n-1)s^2/\sigma^2}{n-1}}} = \frac{\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}}{\sqrt{s^2/\sigma^2}} = \frac{\bar{X}-\mu}{s/\sqrt{n}}$

This statistic, which uses the sample standard deviation $s$ , follows a t-distribution with $n-1$ degrees of freedom.

Importantly,
- The t-distribution approaches the standard normal distribution as the degrees of freedom $k \to \infty$ . This makes sense: as the sample size grows, our estimate $s$ of the true standard deviation $\sigma$ becomes very accurate, and the uncertainty from estimating it vanishes.
Where is it used?
- Machine Learning (t-SNE): As discussed previously, the t-distribution with one degree of freedom is used in the low-dimensional space of t-SNE to create embeddings where clusters are more clearly separated. Its heavy tails are the key to this property.
- Hint for Hypothesis Testing: The t-distribution is the basis for the Student’s t-test, one of the most widely used statistical tests. It is used to compare the means of two groups, or to test whether a sample mean is significantly different from a hypothesized value, when the population variance is unknown. The t-statistic $\frac{\bar{X}-\mu}{s/\sqrt{n}}$ is calculated, and its value is compared to the critical values of the t-distribution to determine statistical significance. It is the go-to tool for inference on means with small sample sizes.
Sources
1. Mathematics for Machine Learning (link)
Uniform, Exponential, Beta & Dirichlet

September 13, 2025
What’s there?

Covers these things –
1. Uniform & Exponential distributions
2. Beta & Dirichlet (distributions over probabilities, key for Bayesian inference)
This should conclude stuff from continuous distributions that I wanted to talk about.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Uniform

Wdym what’s uniform? We’ve already used it so many times.

$f(x | a, b) = \begin{cases} \frac{1}{b-a} & \text{if } a \le x \le b \\ 0 & \text{otherwise} \end{cases}$

$F(x) = \begin{cases} 0 & \text{if } x < a \\ \frac{x-a}{b-a} & \text{if } a \le x \le b \\ 1 & \text{if } x > b \end{cases}$

Exponential

The Exponential distribution is a continuous distribution that models the time between events in a Poisson process.

Foundational insight

Recall that a Poisson process describes events occurring independently at a constant average rate, $\lambda$. The Poisson distribution counts the number of events in a fixed interval.

The Exponential distribution describes the waiting time for the very next event to occur.

Ok? Cool, I also want to say that it is memoryless.

The Exponential distribution is the only continuous distribution that is memoryless. This means that the probability of an event occurring in the next instant is independent of how much time has already elapsed. If you have been waiting for 10 minutes for a bus whose arrival follows an exponential distribution, the probability of it arriving in the next minute is exactly the same as it was when you first arrived. $P(X > s+t | X > s) = P(X > t)$ .

Theoretical development

Formal Definition: A continuous random variable $T$ follows an Exponential distribution with rate parameter $\lambda > 0$ , denoted $T \sim \text{Exp}(\lambda)$ , if its PDF is:

$f(t | \lambda) = \lambda e^{-\lambda t} \mathbf{1}_{t \ge 0}$

The rate parameter $\lambda$ is the same as the rate parameter in the underlying Poisson process (events per unit of time).

$F(t) = P(T \le t) = 1 - e^{-\lambda t} \quad \text{for } t \ge 0$

Expectation (Mean): $\mathbb{E}[T] = 1/\lambda$ . This is intuitive: if events occur at a rate of $\lambda=2$ per hour, the average waiting time is $1/2$ an hour.

And, of course, the variance –

$\text{Var}(T) = 1/\lambda^2$ .

Beta

Ok, now we start the stuff that I really wanted to write this prose for.

Introduction

Beta and Dirichlet Distributions – They are not distributions over data itself, but distributions over the parameters of other distributions. This makes them essential tools for Bayesian inference.

The Beta distribution is a continuous distribution defined on the interval $[0, 1]$ . It is the canonical distribution for modeling uncertainty about a probability, such as the bias of a coin.

Foundations

We must talk about the Gamma Function.

It’s essentially the continuous form of the factorial. It is defined as –

$\Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt$

The crucial connection between the Gamma function and the factorial is given by the identity for any positive integer n –

$\Gamma(n) = (n-1)!$

Properties of th Gamme function
- Recursion Formula: Similar to the factorial,
$\Gamma(z+1) = z\Gamma(z)$
- Domain and Poles: The Gamma function is defined for all complex numbers except for non-positive integers (0, -1, -2, …), where it has simple poles. This means the function goes to infinity at these points.
- Special Values: While the Gamma function can be challenging to compute directly for many values, some special values are well-known. A particularly famous one is:
$\Gamma\left(\frac{1}{2}\right) = \sqrt{\pi}$

This is important and can be uses this further as, $\Gamma\left(\frac{3}{2}\right) = \frac{1}{2}\Gamma\left(\frac{1}{2}\right) = \frac{\sqrt{\pi}}{2}$ .

First principles

Consider a Bernoulli trial (e.g., a coin flip) with an unknown probability of success, $p$ . Before we flip the coin, we are uncertain about the value of $p$ . The Beta distribution provides a way to express this uncertainty.

It is a probability distribution over the parameter $p$ .

Formal Definition: A random variable $P$ follows a Beta distribution, denoted $P \sim \text{Beta}(\alpha, \beta)$ , if its PDF is:

$f(p | \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1}(1-p)^{\beta-1}, \quad p \in [0,1]$

where $\alpha > 0$ and $\beta > 0$ are its hyperparameters (parameters of a distribution over a parameter). The term with the Gamma function $\Gamma(\cdot)$ is a normalization constant.

Interpretation of Hyperparameters: The hyperparameters $\alpha$ and $\beta$ can be intuitively understood as “pseudo-counts” of successes and failures, respectively.
1. $\alpha-1$ is the power of $p$ (the probability of success).
2. $\beta-1$ is the power of $(1-p)$ (the probability of failure).
3. $\text{Beta}(1, 1)$ corresponds to a uniform distribution, representing complete prior uncertainty.
4. $\alpha > \beta$ pushes the probability mass towards 1.
5. $\beta > \alpha$ pushes the probability mass towards 0.
6. Large $\alpha$ and $\beta$ create a distribution sharply peaked around its mean, representing strong prior belief.
Conjugacy

Conjugacy: In Bayesian inference, a prior distribution is conjugate to a likelihood if the resulting posterior distribution belongs to the same family of distributions as the prior.

What? It means that if you choose a specific type of prior distribution (your initial belief) that “matches” your likelihood (the data you observe), your posterior distribution (your updated belief) will be the same type of distribution as your prior, just with updated parameters.

This gets us to the Beta-Bernoulli Model –
1. Prior: We place a Beta prior on the unknown probability $p$ : $p \sim \text{Beta}(\alpha, \beta)$ .
2. Likelihood: We observe data from a Bernoulli or Binomial process. Suppose we observe $s$ successes and $f$ failures. The likelihood is proportional to $p^s(1-p)^f$ .
3. Posterior: According to Bayes’ theorem, the posterior is proportional to the likelihood times the prior:
$p(p | \text{data}) \propto p^s(1-p)^f \times p^{\alpha-1}(1-p)^{\beta-1} = p^{s+\alpha-1}(1-p)^{f+\beta-1}$

This is the kernel of another Beta distribution. The posterior is:

$p(p | \text{data}) = \text{Beta}(\alpha+s, \beta+f)$

Interpretation: The learning process is incredibly simple: we just add the observed counts of successes and failures to our prior pseudo-counts.

Dirichlet

The Dirichlet distribution is the multivariate generalization of the Beta distribution. It is a distribution over a probability vector.

First principles
- The Problem: Consider a Categorical or Multinomial trial with $K$ possible outcomes and an unknown probability vector $\mathbf{p} = [p_1, \dots, p_K]^\top$ , where $\sum p_k = 1$ . The Dirichlet distribution models our uncertainty about this entire probability vector.
- The Simplex: The domain of the Dirichlet distribution is the standard $(K-1)$ -simplex, which is the set of $K$ -dimensional vectors whose components are non-negative and sum to 1.
- Formal Definition: A random vector $\mathbf{P}$ follows a Dirichlet distribution, denoted $\mathbf{P} \sim \text{Dir}(\boldsymbol{\alpha})$ , if its PDF is:
$f(\mathbf{p} | \boldsymbol{\alpha}) = \frac{\Gamma\left(\sum_{k=1}^K \alpha_k\right)}{\prod_{k=1}^K \Gamma(\alpha_k)} \prod_{k=1}^K p_k^{\alpha_k - 1}$
where the hyperparameter $\boldsymbol{\alpha} = [\alpha_1, \dots, \alpha_K]^\top$ is a vector of positive real numbers, which can be interpreted as pseudo-counts for each of the $K$ categories.

Conjugacy, the Dirichlet-Multinomial Model
1. Prior: $\mathbf{p} \sim \text{Dir}(\boldsymbol{\alpha})$
2. Likelihood: We observe data from $n$ trials, with counts $\mathbf{x} = [x_1, \dots, x_K]^\top$ for each category. The likelihood is proportional to $\prod p_k^{x_k}$ .
3. Posterior:
$p(\mathbf{p} | \text{data}) \propto \left(\prod p_k^{x_k}\right) \left(\prod p_k^{\alpha_k-1}\right) = \prod p_k^{x_k+\alpha_k-1}$

The posterior is another Dirichlet distribution:

$p(\mathbf{p} | \text{data}) = \text{Dir}(\boldsymbol{\alpha} + \mathbf{x})$

As with the Beta, learning is a simple matter of adding the observed counts to the prior pseudo-counts.
Sources
1. Mathematics for Machine Learning (link)
Advanced Methods for generating RVs

September 13, 2025
What’s there?

While the Inverse Transform Sampling method provides a theoretical foundation for random sampling, its practical applicability is limited to distributions with an invertible Cumulative Distribution Function (CDF).

So, we have other stuff!

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Box-Muller Transform

We can use this to generate $\mathcal{N}(0, 1)$ from $U(0, 1)$ .

Foundational insight

Remeber that the CDF of the Gaussian distribution, which involves the error function ( $\text{erf}$ ), does not have a closed-form analytical inverse.

What else did you expect? I will give more insights through theory – I don’t want to mention them directly here.

So, I really just wrote this section to mention what is called the Jacobian determinant.

Just remember that

In probability, density is “probability per unit area.” When you switch coordinate systems, the definition of a “unit area” changes.

We’ll get back to this.

Theoretical development

Let $X \sim \mathcal{N}(0, 1)$ and $Y \sim \mathcal{N}(0, 1)$ be two independent standard normal random variables. Their joint PDF is:

$f(x, y) = f(x)f(y) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \cdot \frac{1}{\sqrt{2\pi}}e^{-y^2/2} = \frac{1}{2\pi}e^{-(x^2+y^2)/2}$

The expression $e^{-(x^2+y^2)/2}$ shows that the probability density depends only on the squared distance from the origin ( $x^2+y^2$ ), meaning the distribution is perfectly circular and symmetric.

Sounds familiar?

Yes, we can use polar coordinates now! We can use $x = r\cos\theta$ and $y = r\sin\theta$ , so $r^2 = x^2+y^2$ .

We know that –
- Cartesian Area: A small rectangular area element, $dx\,dy$ , has the same size everywhere on the plane.
- Polar Area: A small “polar patch,” $dr\,d\theta$ , does not have a constant area. A patch with a large radius $r$ is much bigger than a patch with a small radius, even if $dr$ and $d\theta$ are the same.
The fundamental relationship is that the probability in a small patch must be equal in both systems:

$f(x, y) \,dx\,dy = f(r, \theta) \,dr\,d\theta$
$f(r, \theta) = f(x, y) \cdot \frac{dx\,dy}{dr\,d\theta} = f(x, y) \cdot r$

This is the working of the jacobian determinant. Using this, we get –

$f(r, \theta) = r e^{-r^2/2} \cdot \frac{1}{2\pi}$

Algorithm

We see that the joint density factors into two parts: $f(r, \theta) = f_R(r) f_\Theta(\theta)$ , where:
- $f_\Theta(\theta) = \frac{1}{2\pi}$ for $\theta \in [0, 2\pi)$ . This is the PDF of a uniform distribution.
- $f_R(r) = r e^{-r^2/2}$ for $r \ge 0$ . This is the PDF of a Rayleigh distribution.
The Box-Muller algorithm exploits this by generating samples for $R$ and $\Theta$ and then transforming them back.
1. Generate two independent uniform random variables $u_1, u_2 \sim U(0, 1)$ .
2. Generate the angle $\theta = 2\pi u_1$ . This is a direct sample from the uniform distribution for the angle.
3. Generate the radius $r$ . The squared radius, $R^2$ , follows an exponential distribution with rate $\lambda=1/2$ . We can use Inverse Transform Sampling to generate a sample for $R^2$ from $u_2$ : $r^2 = -2\ln(u_2)$ . Thus, $r = \sqrt{-2\ln(u_2)}$ .
4. Convert back to Cartesian coordinates:
$x = r \cos\theta = \sqrt{-2\ln u_2} \cos(2\pi u_1)$

$y = r \sin\theta = \sqrt{-2\ln u_2} \sin(2\pi u_1)$

The resulting $x$ and $y$ are two independent samples from a standard normal distribution.

How to do this for any arbitrary thing?

To generate samples from an arbitrary Gaussian $\mathcal{N}(\mu, \sigma^2)$ , one simply generates a standard normal sample $z$ using Box-Muller and then applies the linear transformation: $x = \mu + \sigma z$ .

Duh.

Rejection Sampling

This is when we want to generate samples from $p(x)$ , but cannot sample from it directly, but we can evaluate its density (up to a constant).

It works by sampling from an easier-to-sample “proposal” distribution and then probabilistically accepting or rejecting those samples.

Setup
1. Target Distribution, $p(x)$ : The complex distribution we want to sample from. We only need to be able to evaluate a function $\tilde{p}(x)$ that is proportional to $p(x)$ , i.e., $p(x) = \tilde{p}(x)/Z$ where the normalization constant $Z$ may be unknown.
2. Proposal Distribution, $q(x)$ : A simpler distribution that we already know how to sample from (e.g., a uniform or Gaussian).
3. Acceptance Constant, $M$ : A constant chosen such that $M \cdot q(x) \ge \tilde{p}(x)$ for all $x$ . The function $M q(x)$ must form an “envelope” that is everywhere above our unnormalized target density.
Algorithm
1. Propose: Draw a sample $x_0$ from the proposal distribution, $x_0 \sim q(x)$ .
2. Evaluate: Calculate the acceptance ratio: $r = \frac{\tilde{p}(x_0)}{M q(x_0)}$ . This ratio is guaranteed to be in $[0, 1]$ .
3. Accept/Reject: Draw a sample from the standard uniform distribution, .
  - If $u < r$ , accept the sample: $x = x_0$ .
  - If $u \ge r$ , reject the sample and return to Step 1.
Interpretation

Yeah, what is even happening here?

Imagine the graphs of $\tilde{p}(x)$ and the envelope $M q(x)$ . The algorithm is equivalent to:
1. Sample a point $(x_0, y_0)$ uniformly from the area under the curve of the envelope $M q(x)$ .
2. If this point is also under the curve of the target density $\tilde{p}(x)$ , accept the x-coordinate $x_0$ . Otherwise, reject it.
  This process correctly generates samples with a distribution of $p(x)$ .
Still don’t get it? Yeah, I didn’t either.

More intuition

For the dummies out there (like me!).

Imagine you want to sample random points from within the shape of an oddly-shaped pond (your complex target distribution), but you can only throw darts randomly at a large, rectangular tarp that covers the entire pond (your simple proposal distribution).
1. Throw a Dart: You throw a dart, and it lands somewhere on the tarp. This is Step 1: Propose.
2. Check Where it Landed: You look at the dart’s location.
3. Accept or Reject:
  - If the dart landed in the pond, you accept its location as a valid sample.
  - If the dart landed on the tarp but outside the pond, you reject it and throw another dart.
If you do this many times, the collection of accepted dart locations will be a perfect random sample from within the shape of the pond.

This is just the most basic thing, not the algorithm.

The shapes now have varying heights.
- Propose (Throw a horizontal dart): When you draw a sample $x_{prop}$ from $q(x)$ , you are choosing a random horizontal position on the dartboard.
- Accept/Reject (Throw a vertical dart): Now, at that horizontal position, you need to decide whether to keep the sample. This is done by throwing a “vertical” dart.
  1. The height of the dartboard at this position is $M \cdot q(x_{prop})$ .
  2. The height of your target shape is $\tilde{p}(x_{prop})$ .
  3. The acceptance ratio $r = \frac{\tilde{p}(x_{prop})}{M \cdot q(x_{prop})}$ tells you how much of the dartboard’s height is “filled” by the target shape at that exact spot.
  4. You then draw a random number $u$ between 0 and 1. If $u$ is less than or equal to the ratio $r$ , it means your “vertical dart” has landed underneath the curve of the target shape. You accept the sample. If not, it landed above the target but still on the dartboard, so you reject it.
This process ensures that you are more likely to accept samples in regions where the target distribution $\tilde{p}(x)$ is high and less likely to accept them where it’s low. Over many trials, the collection of accepted samples perfectly mimics the shape of the target distribution you wanted all along.

Ok, why not accept all points with r < 1?

That is, with the acceptance ratio less than 1?

The proposal distribution, $g(x)$ , generates samples according to its own shape. For example, if $g(x)$ is a uniform distribution, it will propose samples evenly across its domain. If we simply accepted every sample we generated, our final collection of points would just be a uniform distribution.

The random comparison, $u \le r$ , is a clever trick to “thin out” the samples from the proposal distribution so that their final density matches the target.

Think of the acceptance ratio, $r = \frac{\tilde{p}(x)}{M \cdot g(x)}$ , as the percentage of the envelope’s height that is “filled” by the target curve at a specific point $x$ .
- Where the Target is High: In a region where $\tilde{p}(x)$ is high (a peak), the ratio $r$ will be close to 1. This means you have a very high chance of accepting a sample proposed in this region.
- Where the Target is Low: In a region where $\tilde{p}(x)$ is low (a valley), the ratio $r$ will be close to 0. This means you have a very low chance of accepting a sample proposed here, and you’ll reject most of them.
This probabilistic step ensures that you keep more samples at the peaks and fewer samples in the valleys, perfectly sculpting the simple proposal distribution into the shape of your desired target distribution.

Makes sense? Yay!

Sources
1. Mathematics for Machine Learning (link)
Generating and Combining RVs

September 13, 2025
What’s there?

An exposition (I love this word sm) into techniques of Inverse Transform Sampling and the algebra of random variables.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Inverse Transform Sampling

This is how you generate random samples from any probability distribution.

Foundations

Y’know the standard uniform distribution, $U(0, 1)$ . This is a continuous distribution where every value in the interval $[0, 1]$ is equally likely.

We assume the existence of a pseudorandom number generator (PRNG) that can produce such samples.

Imma also just recap the properties of the CDF ( $F_X(x) = P(X \le x)$ )
1. It is a non-decreasing function.
2. It is right-continuous.
3. $\lim_{x \to -\infty} F_X(x) = 0$ and $\lim_{x \to \infty} F_X(x) = 1$ . Tts range is therefore the interval $[0, 1]$ .
Intuition

A simple observation – if we apply the CDF of a random variable $X$ to the variable itself, the resulting random variable $Y = F_X(X)$ follows a standard uniform distribution, $Y \sim U(0,1)$ . The inverse of this statement is the key to the sampling method: if we start with a uniform random variable and apply the inverse CDF, we will generate a random variable with the desired distribution.

So cool.

Theoretical development

Theorem (Probability Integral Transform): Let $X$ be a continuous random variable with a strictly monotonic CDF, $F_X(x)$ . Then the random variable $Y = F_X(X)$ is uniformly distributed on $[0, 1]$ .

Proof: We want to find the CDF of $Y$ , denoted $F_Y(y) = P(Y \le y)$ for $y \in [0, 1]$ .

$\begin{aligned} F_Y(y) &= P(Y \le y) \ &= P(F_X(X) \le y) \quad (\text{by definition of } Y) \end{aligned}$

Since $F_X$ is a strictly increasing function, its inverse $F_X^{-1}$ exists. We can apply it to both sides of the inequality inside the probability expression without changing the direction of the inequality:

$\begin{aligned} F_Y(y) &= P(F_X^{-1}(F_X(X)) \le F_X^{-1}(y)) \ &= P(X \le F_X^{-1}(y)) \end{aligned}$

By the definition of the CDF of $X$ , $P(X \le x') = F_X(x')$ . Therefore:

$F_Y(y) = F_X(F_X^{-1}(y)) = y$

The CDF of the random variable $Y$ is $F_Y(y) = y$ for $y \in [0, 1]$ . This is the CDF of the standard uniform distribution, $U(0, 1)$ .

Importantly, note the strictly monotonic CDF bit!

Algorithm
1. Generate a uniform sample: Draw a random number $u$ from the standard uniform distribution, $u \sim U(0, 1)$ .
2. Apply the inverse CDF: Compute the value $x = F_X^{-1}(u)$ .
3. Result: The resulting value $x$ is a random sample from the distribution with CDF $F_X(x)$ .
When does this not work?

This requires the CDF to be invertible. What if it isn’t? Example – for Guassians (the most predominant stuff). Update – Advanced Methods for generating RVs

Sums and Products of Random Variables

What do we need to do?

Let $X$ and $Y$ be two random variables with known distributions. We define a new random variable, $Z$ , as a function of them, for example $Z = X+Y$ or $Z=XY$ . The goal is to find the probability distribution of $Z$ .

Sum of independent RVs

To find the distribution of the sum, we first find its CDF, $F_Z(z) = P(Z \le z) = P(X+Y \le z)$ . This probability is the double integral of the joint density $f(x, y)$ over the region where $x+y \le z$ .

$F_Z(z) = \iint_{x+y \le z} f(x, y) dx dy$

Wut. How can we even compute this. Yeah, dw, we use independence and say that their joint density is the product of their marginal densities, $f(x, y) = f_X(x)f_Y(y)$ .

We also differentiate – since the PDF is the derivative of the CDF (wrt $z$ ) This gives us that convolution formula!

$f_Z(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z-x) dx = (f_X * f_Y)(z)$

(I derive the product one properly to give better understanding, btw.)

Important Special Cases:
- Sum of Gaussians: The sum of two independent Gaussian random variables is also a Gaussian. If $X \sim \mathcal{N}(\mu_X, \sigma_X^2)$ and $Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2)$ , then $Z=X+Y \sim \mathcal{N}(\mu_X+\mu_Y, \sigma_X^2+\sigma_Y^2)$ . This is a rare case where the convolution has a simple, closed-form solution.
- Sum of Poissons: The sum of two independent Poisson random variables is also a Poisson. If $X \sim \text{Poisson}(\lambda_X)$ and $Y \sim \text{Poisson}(\lambda_Y)$ , then $Z=X+Y \sim \text{Poisson}(\lambda_X+\lambda_Y)$ .
Product of independent RVs

Define the product

The CDF of $Z$ , denoted $F_Z(z)$ , is the probability that $Z$ is less than or equal to some value $z$ .

$F_Z(z) = P(Z \le z) = P(XY \le z)$

Since $X$ and $Y$ are independent, we can express this probability as a double integral of their joint density, $f(x, y) = f_X(x)f_Y(y)$ , over the region where $xy \le z$ .

$F_Z(z) = \iint_{xy \le z} f_X(x) f_Y(y) \,dx\,dy$

To make this integral solvable, we need to set explicit limits. The inequality $y \le z/x$ or $y \ge z/x$ depends on the sign of $x$ . We must split the integral into two parts: one for $x > 0$ and one for $x < 0$ .

Define the CDF properly
1. When $x > 0$ , the condition is $y \le z/x$ .
2. When $x < 0$ , the condition is $y \ge z/x$ .
This allows us to rewrite the CDF as:

$F_Z(z) = \int_{0}^{\infty} f_X(x) \left( \int_{-\infty}^{z/x} f_Y(y) \,dy \right) dx + \int_{-\infty}^{0} f_X(x) \left( \int_{z/x}^{\infty} f_Y(y) \,dy \right) dx$

We can express the inner integrals in terms of the CDF of $Y$ , which is $F_Y(y) = \int_{-\infty}^{y} f_Y(t) \,dt$ .

$F_Z(z) = \int_{0}^{\infty} f_X(x) F_Y(z/x) \,dx + \int_{-\infty}^{0} f_X(x) [1 - F_Y(z/x)] \,dx$

Find the PDF

The PDF, $f_Z(z)$ , is the derivative of the CDF with respect to $z$ . We can differentiate under the integral sign using the Leibniz integral rule.

$f_Z(z) = \frac{d}{dz}F_Z(z) = \frac{d}{dz}\int_{0}^{\infty} f_X(x) F_Y(z/x) \,dx + \frac{d}{dz}\int_{-\infty}^{0} f_X(x) [1 - F_Y(z/x)] \,dx$

$f_Z(z) = \int_{0}^{\infty} f_X(x) \frac{d}{dz}F_Y(z/x) \,dx + \int_{-\infty}^{0} f_X(x) \frac{d}{dz}[1 - F_Y(z/x)] \,dx$

Using the chain rule, the derivative of $F_Y(z/x)$ with respect to $z$ is:

$\frac{d}{dz}F_Y(z/x) = f_Y(z/x) \cdot \frac{1}{x}$

Substituting this back into our expression:

$f_Z(z) = \int_{0}^{\infty} f_X(x) f_Y(z/x) \frac{1}{x} \,dx + \int_{-\infty}^{0} f_X(x) \left(-f_Y(z/x) \frac{1}{x}\right) \,dx$

$f_Z(z) = \int_{0}^{\infty} f_X(x) f_Y(z/x) \frac{1}{x} \,dx + \int_{-\infty}^{0} f_X(x) f_Y(z/x) \frac{-1}{x} \,dx$

For the first integral, $x > 0$ , so $\frac{1}{x} = \frac{1}{|x|}$ . For the second integral, $x < 0$ , so $\frac{-1}{x} = \frac{1}{|x|}$ . Since both integrands are now identical, we can combine the two integrals back into one:

$f_Z(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z/x) \frac{1}{|x|} dx$

Sources
1. Mathematics for Machine Learning (link)
2. Wikipedia page on Leibniz integral rule (link)
A small introduction to Gaussian Processes

September 9, 2025
What’s there?

Started with some Probability and Statistics; doing Gaussians now. This post is to go deep into Gaussian Processes.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Why does this exist?

A parametric model, such as polynomial regression ( $f(x) = \sum_{i=0}^M \theta_i x^i$ ), makes a strong, fixed assumption about the functional form of the relationship being modeled.

The learning process is confined to finding the optimal parameters $\boldsymbol{\theta}$ within this pre-defined structure.

The fundamental question that motivates Gaussian Processes is – Can we perform inference about an unknown function without first committing to a rigid parametric form?

From Multivariate Gaussians to Gaussian Processes
1. A univariate Gaussian is a distribution over a scalar random variable.
2. A multivariate Gaussian is a distribution over a vector of random variables, $\mathbf{x} = [X_1, \dots, X_D]^\top$ . It is completely specified by a mean vector $\boldsymbol{\mu}$ and a covariance matrix $\mathbf{\Sigma}$ . The covariance matrix describes the relationships between all pairs of variables in the vector.
3. A Gaussian Process is the logical extension of this concept to an infinite-dimensional setting. It is a distribution over a function $f(\mathbf{x})$ .
Defining a GP

A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP defines a probability distribution over a function $f: \mathcal{X} \to \mathbb{R}$ . For any finite set of input points ${\mathbf{x}_1, \dots, \mathbf{x}_N}$ , the corresponding vector of function values $[f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^\top$ is a random vector that follows a multivariate Gaussian distribution.

A Gaussian Process is completely specified by two functions:
1. The Mean Function, $m(\mathbf{x})$ : This function defines the expected value of the function at any input point $\mathbf{x}$ .
$m(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})]$

It represents our prior belief about the average shape of the function. For notational simplicity, the mean function is often assumed to be the zero function, $m(\mathbf{x}) = 0$ .
1. The Covariance Function (or Kernel), $k(\mathbf{x}, \mathbf{x}')$ : This function defines the covariance between the function values at any two input points, $\mathbf{x}$ and $\mathbf{x}'$ .
$k(\mathbf{x}, \mathbf{x}') = \mathbb{E}[(f(\mathbf{x}) - m(\mathbf{x}))(f(\mathbf{x}') - m(\mathbf{x}'))]$
The kernel encodes our prior beliefs about the properties of the function, such as its smoothness, periodicity, or stationarity. The choice of kernel is the most critical modeling decision when using a GP. Why? A valid kernel function must ensure that the covariance matrix it generates for any set of points is always positive semi-definite.

We denote a Gaussian Process prior over a function as:

$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$

Huh?

You sample a function from a GP, duh.

Bayesian Inference with GPs

This is why I even started this. The primary use of Gaussian Processes is for Bayesian regression.

Gaussian Process Regression Model

The full generative model assumes that our observed targets $y$ are evaluations of the latent function $f$ corrupted by independent, identically distributed Gaussian noise:
1. Prior over the latent function: $f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$
2. Likelihood of observations: $y_n = f(\mathbf{x}_n) + \epsilon_n$ , where $\epsilon_n \sim \mathcal{N}(0, \sigma_n^2)$ . This is equivalent to the likelihood $p(y_n | f(\mathbf{x}_n)) = \mathcal{N}(y_n | f(\mathbf{x}_n), \sigma_n^2)$ .
Such cool notation, right?

Posterior Predictive Distribution

Let us consider a training dataset $\mathcal{D} = \{\mathbf{X}, \mathbf{y}\}$ , where $\mathbf{X} = \{\mathbf{x}_n\}_{n=1}^N$ are the training inputs and $\mathbf{y} = \{y_n\}_{n=1}^N$ are the noisy training targets.

As with any regression,

We want to predict the function value $f_* = f(\mathbf{x}_*)$ at a new test point $\mathbf{x}_*$ .

Here’s how we do it –

The core of GP inference lies in the foundational definition: any finite collection of function values is jointly Gaussian.

Therefore, the training outputs $\mathbf{y}$ and the test output $f_*$ are jointly Gaussian –

$\begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mathbf{m}(\mathbf{X}) \\ m(\mathbf{x}_*) \end{bmatrix}, \begin{bmatrix} \mathbf{K}(\mathbf{X}, \mathbf{X}) + \sigma_n^2\mathbf{I} & \mathbf{k}(\mathbf{X}, \mathbf{x}_*) \\ \mathbf{k}(\mathbf{x}_*, \mathbf{X}) & k(\mathbf{x}_*, \mathbf{x}_*) \end{bmatrix} \right)$

where:
- $\mathbf{m}(\mathbf{X})$ is the vector of prior means at the training points.
- $\mathbf{K}(\mathbf{X}, \mathbf{X})$ is the $N \times N$ covariance matrix where $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ . The term $\sigma_n^2\mathbf{I}$ is added to account for the independent observation noise.
- $\mathbf{k}(\mathbf{X}, \mathbf{x}_*)$ is the $N \times 1$ vector of covariances between the training points and the test point.
- $k(\mathbf{x}_*, \mathbf{x}_*)$ is the prior variance at the test point.
Notice how we now have a joint Gaussian distribution of the form $p(\mathbf{y}, f_*)$ .

The goal is to find the posterior predictive distribution $p(f_* | \mathbf{y}, \mathbf{X}, \mathbf{x}_*)$ .

Here, I just state the analytical solution –

Interpretation:
- The posterior mean is a linear combination of the observed training targets (adjusted by their prior means).
- The posterior variance represents our uncertainty about the function value at the test point. It is the prior variance at that point, reduced by an amount that reflects the information gained from the training data.
  The variance is lowest near the training points and grows as we move away from them, correctly capturing that our uncertainty increases in regions where we have no data.
How tos

The Kernel Function and Model Properties

The choice of kernel is the primary mechanism for incorporating prior knowledge into the model. A widely used kernel is the Squared Exponential (or Radial Basis Function) kernel:

$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left(-\frac{1}{2l^2} |\mathbf{x} - \mathbf{x}'|^2\right)$

This kernel is governed by two hyperparameters:
- Length-scale ( $l$ ): Controls how quickly the correlation between function values decays with distance. A small $l$ produces rapidly varying (“wiggly”) functions, while a large $l$ produces smooth functions.
- Signal Variance ( $\sigma_f^2$ ): Controls the overall vertical variation of the function from its mean.
Learning the Hyperparameters

The kernel parameters and the noise variance $\sigma_n^2$ are typically not set by hand. They are learned from the data by maximizing the marginal log-likelihood:

$\log p(\mathbf{y} | \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2}\mathbf{y}^\top (\mathbf{K}_{\boldsymbol{\theta}} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}_{\boldsymbol{\theta}} + \sigma_n^2\mathbf{I}| - \frac{N}{2}\log(2\pi)$

where $\boldsymbol{\theta}$ denotes the set of all hyperparameters. This objective function has a natural interpretation as an implementation of Occam’s razor.

The first term is a data-fit term, while the second term, $-\frac{1}{2}\log|\mathbf{K}_{\boldsymbol{\theta}} + \sigma_n^2\mathbf{I}|$ , is a complexity penalty term that penalizes overly complex models.

Sources
1. Mathematics for Machine Learning (link)
Gaussians 1

September 9, 2025
What’s there?

Started with some Probability and Statistics; doing Gaussians now. This is an info dump.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Introduction

Definition of a Univariate Gaussian

A random variable $X$ is said to follow a Gaussian distribution if its PDF is given by the specific functional form:

$p(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right)$

Properties
- Symmetry: The PDF is perfectly symmetric about its mean, $\mu$ . This implies that the mean, median, and mode of the distribution are all equal.
- The Normalization Constant: The term $\frac{1}{\sqrt{2\pi\sigma^2}}$ is the normalization constant. It ensures that the total area under the curve integrates to 1, as required for any valid PDF:
$\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\right) dx = 1$
- The Standard Normal Distribution: A special case of the Gaussian with $\mu=0$ and $\sigma^2=1$ is called the standard normal distribution, denoted $\mathcal{N}(0, 1)$ . Any Gaussian variable $X \sim \mathcal{N}(\mu, \sigma^2)$ can be transformed into a standard normal variable $Z$ via standardization:
$Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$

68-95-99.7 Empirical Rule

I’m just adding this just like that.
- One Standard Deviation ( $\mu \pm \sigma$ ): Approximately 68.27% of the area under the curve lies within one standard deviation of the mean.
$P(\mu - \sigma \le X \le \mu + \sigma) \approx 0.68$
- Two Standard Deviations ( $\mu \pm 2\sigma$ ): Approximately 95.45% of the area under the curve lies within two standard deviations of the mean.
$P(\mu - 2\sigma \le X \le \mu + 2\sigma) \approx 0.95$
- Three Standard Deviations ( $\mu \pm 3\sigma$ ): Approximately 99.73% of the area under the curve lies within three standard deviations of the mean.
$P(\mu - 3\sigma \le X \le \mu + 3\sigma) \approx 0.997$

Stability under linear transformation

The Gaussian distribution is stable under affine transformations. If $X \sim \mathcal{N}(\mu, \sigma^2)$ and $Y = aX + b$ for scalars $a$ and $b$ , then $Y$ is also a Gaussian random variable. Its mean and variance are:
- $\mathbb{E}[Y] = \mathbb{E}[aX+b] = a\mathbb{E}[X]+b = a\mu + b$
- $\text{Var}(Y) = \text{Var}(aX+b) = a^2\text{Var}(X) = a^2\sigma^2$
  Therefore, $Y \sim \mathcal{N}(a\mu+b, a^2\sigma^2)$ .
Also, the sum of two independent Gaussian random variables is a Gaussian random variable.

Now, more introductions

The Multivariate Gaussian Distribution

A random vector $\mathbf{X} \in \mathbb{R}^D$ is said to have a multivariate Gaussian distribution if its Probability Density Function (PDF) is given by:

$p(\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$

We use the notation $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{\Sigma})$ .

Let’s deconstruct this formula –
- Normalization Constant: The term $\frac{1}{(2\pi)^{D/2}|\mathbf{\Sigma}|^{1/2}}$
  The term $|\mathbf{\Sigma}|$ is the determinant of the covariance matrix, which geometrically represents the squared volume of the parallelepiped formed by the eigenvectors of $\mathbf{\Sigma}$ .
- The Exponential Argument: The term inside the exponent, $\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})$
  This quadratic form is known as the Mahalanobis distance squared. It measures the distance from a point $\mathbf{x}$ to the mean $\boldsymbol{\mu}$ , taking into account the covariance of the data. It is a unitless distance that accounts for the fact that the data cloud may be stretched and rotated.
The Geometry of the MVN

The loci of points with constant probability density (the isocontours) are the sets of $\mathbf{x}$ for which the Mahalanobis distance is constant:

$(\mathbf{x} - \boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) = \text{const}$

This is the equation of a hyperellipse centered at $\boldsymbol{\mu}$ . The shape and orientation of this hyperellipse are completely determined by the covariance matrix $\mathbf{\Sigma}$ .

Now, for understanding the geoemtry of the ellipse, do an eigendecomposition –

$\mathbf{\Sigma} = \mathbf{P}\mathbf{D}\mathbf{P}^\top = \sum_{i=1}^D \lambda_i \mathbf{p}_i \mathbf{p}_i^\top$

where $\mathbf{P}$ is an orthogonal matrix whose columns ${\mathbf{p}_i}$ are the eigenvectors of $\mathbf{\Sigma}$ , and $\mathbf{D}$ is a diagonal matrix of the corresponding non-negative eigenvalues ${\lambda_i}$ .
- Principal Axes: The eigenvectors $\mathbf{p}_i$ of $\mathbf{\Sigma}$ define the directions of the principal axes of the hyperellipse.
- Axis Lengths: The eigenvalues $\lambda_i$ determine the spread along these axes. The length of the semi-axis along the direction $\mathbf{p}_i$ is proportional to the square root of the corresponding eigenvalue, $\sqrt{\lambda_i}$ .
This means that the

MVN is a cloud of points whose main axes of variation are aligned with the eigenvectors of its covariance matrix.

Partitioning the vector

We begin with a joint Gaussian distribution over a high-dimensional random vector $\mathbf{x} \in \mathbb{R}^D$ . Let’s partition this vector into two disjoint subsets, $\mathbf{x}_a$ and $\mathbf{x}_b$ .

$\mathbf{x} = \begin{bmatrix} \mathbf{x}_a \ \mathbf{x}_b \end{bmatrix}$

The joint distribution is defined by a partitioned mean vector and a partitioned covariance matrix:

$p(\mathbf{x}_a, \mathbf{x}_b) = \mathcal{N}\left( \begin{bmatrix} \mathbf{x}_a \\ \mathbf{x}_b \end{bmatrix} \Bigg| \begin{bmatrix} \boldsymbol{\mu}_a \\ \boldsymbol{\mu}_b \end{bmatrix}, \begin{bmatrix} \mathbf{\Sigma}_{aa} & \mathbf{\Sigma}_{ab} \\ \mathbf{\Sigma}_{ba} & \mathbf{\Sigma}_{bb} \end{bmatrix} \right)$

where $\boldsymbol{\mu}_a, \boldsymbol{\mu}_b$ are the mean vectors, $\mathbf{\Sigma}_{aa}, \mathbf{\Sigma}_{bb}$ are the covariance matrices of $\mathbf{x}_a$ and $\mathbf{x}_b$ respectively, and $\mathbf{\Sigma}_{ab} = \mathbf{\Sigma}_{ba}^\top$ is the cross-covariance matrix.

We can now ask two questions
1. Marginalization: What is the distribution of one subset, $p(\mathbf{x}_a)$ , if we have no information about the other?
2. Conditioning: What is the distribution of one subset, $p(\mathbf{x}_a | \mathbf{x}_b)$ , if we observe the value of the other?
Marginals and Conditions

Marginals

To find the marginal distribution $p(\mathbf{x}_a)$ , we must integrate out (or “marginalize”) the other variable, $\mathbf{x}_b$ , from the joint distribution:

$p(\mathbf{x}_a) = \int p(\mathbf{x}_a, \mathbf{x}_b) d\mathbf{x}_b$

The marginal distribution of a joint Gaussian is also a Gaussian.
The result of this integration is remarkably simple. We effectively just “read off” the corresponding parts from the joint distribution’s parameters:

$p(\mathbf{x}_a) = \mathcal{N}(\mathbf{x}_a | \boldsymbol{\mu}_a, \mathbf{\Sigma}_{aa})$

Intuition: If you have a 2D Gaussian “mountain,” its shadow projected onto the x-axis (integrating out y) is a 1D Gaussian bell curve.

Conditionals

Remember that $p(\mathbf{x}_a | \mathbf{x}_b) = p(\mathbf{x}_a, \mathbf{x}_b) / p(\mathbf{x}_b)$ .

The conditional distribution of a joint Gaussian is also a Gaussian.

The derivation involves algebraic manipulation of the exponents of the Gaussian PDFs (completing the square).

Without going too much into the maths, I’ll offer the following results –
- Conditional Mean:
$\boldsymbol{\mu}_{a|b} = \boldsymbol{\mu}_a + \mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}(\mathbf{x}_b - \boldsymbol{\mu}_b)$
- Conditional Covariance:
$\mathbf{\Sigma}_{a|b} = \mathbf{\Sigma}_{aa} - \mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}\mathbf{\Sigma}_{ba}$

Interpretation:
- The conditional mean is a linear function of the observed variable $\mathbf{x}_b$ . It starts at the prior mean $\boldsymbol{\mu}_a$ and is adjusted based on how surprising the observation $\mathbf{x}_b$ is (i.e., its deviation from its own mean, $\mathbf{x}_b - \boldsymbol{\mu}_b$ ). The adjustment is scaled by the correlation term $\mathbf{\Sigma}_{ab}\mathbf{\Sigma}_{bb}^{-1}$ .
- The conditional covariance is independent of the observation $\mathbf{x}_b$ . Observing data reduces our uncertainty. The new covariance is the prior covariance $\mathbf{\Sigma}_{aa}$ minus a positive semi-definite term that represents the information gained from the observation.
Product of Gaussian Densities

When applying Bayes’ theorem, we often need to multiply a Gaussian likelihood by a Gaussian prior. This product is also related to the Gaussian family.

Bayesian Inference Context

In a Bayesian setting, the posterior is proportional to the likelihood times the prior:

$p(\boldsymbol{\theta} | \mathcal{D}) \propto p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta})$

Yeah, this is just a fancy term.

What this is

The product of two Gaussian PDFs is an unnormalized Gaussian PDF.
Let’s consider two Gaussian densities (not necessarily normalized distributions) over the same variable $\mathbf{x}$ :

$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_a, \mathbf{A}) \quad \text{and} \quad \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_b, \mathbf{B})$

Their product is:

$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_a, \mathbf{A}) \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_b, \mathbf{B}) = c \cdot \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_c, \mathbf{C})$

where the new covariance $\mathbf{C}$ and mean $\boldsymbol{\mu}_c$ of the resulting Gaussian are given by:
- New Covariance: $\mathbf{C} = (\mathbf{A}^{-1} + \mathbf{B}^{-1})^{-1}$
- New Mean: $\boldsymbol{\mu}_c = \mathbf{C}(\mathbf{A}^{-1}\boldsymbol{\mu}_a + \mathbf{B}^{-1}\boldsymbol{\mu}_b)$
- The term $c$ is a scaling constant that does not depend on $\mathbf{x}$ . It is the value needed to normalize the new Gaussian.
Interpretation in terms of Information: The inverse of a covariance matrix is known as the precision matrix. The equations show that when multiplying Gaussians, we simply add their precisions. This aligns with the intuition that combining two sources of information results in a more precise (higher precision, lower variance) final belief. The new mean is a precision-weighted average of the original means.

Sums and Linear Transformations of Gaussians

Setting

We apply a linear (or more generally, affine) transformation to $\mathbf{x}$ to produce a new random variable $\mathbf{y} \in \mathbb{R}^K$ :

$\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{b}$

where $\mathbf{A} \in \mathbb{R}^{K \times D}$ and $\mathbf{b} \in \mathbb{R}^K$ .

We want to know

What is the distribution of $\mathbf{y}$ ?

Vineeth tells you what’s up

An affine transformation of a Gaussian random variable is also a Gaussian random variable.

Computing the New Mean: We use the linearity of expectation:

$\mathbb{E}[\mathbf{y}] = \mathbb{E}[\mathbf{A}\mathbf{x} + \mathbf{b}] = \mathbf{A}\mathbb{E}[\mathbf{x}] + \mathbf{b} = \mathbf{A}\boldsymbol{\mu} + \mathbf{b}$

Computing the New Covariance: We use the property of covariance under affine transformations:

$\text{Cov}(\mathbf{y}) = \text{Cov}(\mathbf{A}\mathbf{x} + \mathbf{b}) = \mathbf{A}\text{Cov}(\mathbf{x})\mathbf{A}^\top = \mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top$

Therefore, the distribution of the transformed variable $\mathbf{y}$ is:

$p(\mathbf{y}) = \mathcal{N}(\mathbf{y} | \mathbf{A}\boldsymbol{\mu} + \mathbf{b}, \mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top)$

Special Case: Sum of Independent Gaussians

If we have two independent Gaussian random variables, $X \sim \mathcal{N}(\mu_X, \sigma_X^2)$ and $Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2)$ , their sum $Z=X+Y$ is also a Gaussian. This is a special case of the linear transformation where $\mathbf{x} = [X, Y]^\top$ , $\mathbf{A}=[1, 1]$ , and $\mathbf{b}=0$ . The resulting distribution is:

$Z \sim \mathcal{N}(\mu_X + \mu_Y, \sigma_X^2 + \sigma_Y^2)$

The means add, and because of independence, the variances add.

Sources
1. Mathematics for Machine Learning (link)
Fundamental Discrete Probability Distributions

September 7, 2025
What’s there?

Like Foundations of Probability, I wanted to cover some more basics before Gaussians really quickly.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Bernoulli and Binomial Distributions

Why together?

They form the basis for modeling any experiment with a binary outcome.

Foundation – The Single Trial

Bernoulli Trial

A single experiment with exactly two possible, mutually exclusive outcomes. These outcomes are generically labeled “success” and “failure”.

Bernoulli Distribution

A random variable $X$ is said to follow a Bernoulli distribution if it is the outcome of a single Bernoulli trial.

The distribution is governed by a single parameter, $p \in [0, 1]$ , which represents the probability of success.

Probability Mass Function (PMF): The PMF of a Bernoulli random variable is

$P(X=x | p) = \begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases}$

This can be written more compactly as

$p(x|p) = p^x (1-p)^{1-x}, \quad x \in {0, 1}$

Moments:
- Expectation: $\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1-p) = p$
- Variance: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = (1^2 \cdot p + 0^2 \cdot (1-p)) - p^2 = p - p^2 = p(1-p)$
Foundation – Sum of Trials

Binomial Experiment

The Binomial distribution arises from extending the Bernoulli trial. It describes the outcome of an experiment that satisfies the following four conditions:

1. The experiment consists of a fixed number of trials, $n$ .
2. Each trial is independent of the others.
3. Each trial has only two possible outcomes (success/failure).
4. The probability of success, $p$ , is the same for each trial.

Binomial Distribution

A random variable $K$ is said to follow a Binomial distribution if it represents the total number of successes in $n$ independent Bernoulli trials. It is governed by two parameters: the number of trials $n \in \mathbb{N}$ and the probability of success $p \in [0, 1]$ . We write $K \sim \text{Bin}(n, p)$ .

Probability Mass Function (PMF): To find the probability of observing exactly $k$ successes in $n$ trials, we need two components:
1. The probability of any specific sequence of $k$ successes and $n-k$ failures is $p^k(1-p)^{n-k}$ , due to independence.
2. The number of distinct ways to arrange $k$ successes among $n$ trials is given by the binomial coefficient: $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ . Combining these gives the PMF:
$P(K=k | n, p) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k \in {0, 1, \dots, n}$

Moments: The Binomial random variable $K$ can be seen as the sum of $n$ independent Bernoulli random variables, $K = \sum_{i=1}^n X_i$ where $X_i \sim \text{Bernoulli}(p)$ . Using the linearity of expectation and the property that the variance of a sum of independent variables is the sum of their variances:
- Expectation: $\mathbb{E}[K] = \mathbb{E}[\sum X_i] = \sum \mathbb{E}[X_i] = \sum p = np$
- Variance: $\text{Var}(K) = \text{Var}[\sum X_i] = \sum \text{Var}[X_i] = \sum p(1-p) = np(1-p)$
Poisson Distribution

The Poisson distribution can be derived as the limit of the Binomial distribution, $\text{Bin}(n, p)$ , as the number of trials $n$ goes to infinity while the probability of success $p$ goes to zero, in such a way that their product remains constant: $np = \lambda$ .

Huh?

Imagine dividing a one-hour interval into $n=3600$ one-second subintervals. Let the “success” be an event (e.g., a customer arriving) occurring in a subinterval. The probability $p$ of success in any one-second subinterval is very small, but the number of trials $n$ is very large. In this limit, the Binomial PMF converges to the Poisson PMF.

Properties

Probability Mass Function (PMF):

$P(K=k | \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k \in {0, 1, 2, \dots}$

The range of a Poisson variable is countably infinite. The term $e^{-\lambda}$ is a normalization constant that ensures the probabilities sum to 1.

Moments:
- Expectation: $\mathbb{E}[K] = \lambda$
- Variance: $\text{Var}(K) = \lambda$
When is this even used? Predicting counts.

What if we want to predict the count of things? That is,

When the target variable in a regression problem is a count (e.g., number of purchases a customer makes in a month, number of clicks on an ad), standard linear regression is inappropriate.

Why?

A linear regression model doesn’t know the target variable is a count. It can easily predict negative values (e.g., -2 purchases) or fractional values (e.g., 4.7 clicks), which are nonsensical in this context.

Moreover, and this is important,
Linear regression assumes that the variability (variance) of the data points is the same regardless of their predicted value. Huh?

I have described a statistical property called heteroscedasticity, where the variability of a variable is unequal across the range of values of a second variable that predicts it.

In simpler terms, the data’s “spread” or “scatter” is not constant!

Standard linear regression assumes homoscedasticity. This means it expects the errors (the difference between the actual data points and the regression line) to have the same variance at all levels of the predictor variable.

Think of it as a consistent “error band” around the regression line. The amount of scatter should be roughly the same for small predicted values as it is for large ones.

Count data is often heteroscedastic. The variance tends to increase as the mean count increases. For instance, consider the example of daily website visitors –
- Small Personal Blog: The average number of visitors might be 10 per day. The daily count probably won’t vary much, maybe staying within a tight range like 5 to 15. The variance is low.
- Popular Website: The average might be 10,000 visitors per day. On a slow day, it might get 8,000, and on a viral day, it might get 15,000. The range of possible values is much wider, so the variance is high.
What do we do instead?

Poisson regression is used instead, where the response variable is assumed to follow a Poisson distribution, and the logarithm of its mean parameter $\lambda$ is modeled as a linear function of the features.
1. This distribution is defined only for non-negative integers, so the model’s underlying assumption matches the nature of the data.
2. Also, a key property of the Poisson distribution is that its mean is equal to its variance, which naturally handles the issue of non-constant variance.
How does it work?

Instead of modeling the mean count directly, Poisson regression models the logarithm of the mean as a linear combination of the features.

$\log(\lambda) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n$

Using the log ensures that the predicted mean, $\lambda = \exp(\beta_0 + \beta_1x_1 + \dots)$ , will always be positive, thus preventing the model from making impossible negative predictions. This logarithmic transformation is called a link function.

Categorical and Multinomial Distributions

These are the direct generalizations of the Bernoulli and Binomial distributions to experiments with more than two possible outcomes.

Foundation – The Single Multi-Class Trial

Categorical Trial:

This is a single experiment with $K$ possible, mutually exclusive outcomes (or “categories”).

Categorical Distribution:

A random variable $X$ follows a Categorical distribution if it is the outcome of a single such trial.

The distribution is governed by a parameter vector $\mathbf{p} = [p_1, \dots, p_K]^\top$ , where $p_k$ is the probability of the $k$ -th outcome, and $\sum_{k=1}^K p_k = 1$ .
- Representation: The outcome is typically represented using one-hot encoding. A trial resulting in category $k$ is represented by a $K$ -dimensional vector $\mathbf{x} = [0, \dots, 1, \dots, 0]^\top$ , where the 1 is in the $k$ -th position.
- Probability Mass Function (PMF): For a one-hot encoded vector $\mathbf{x}$ :
$p(\mathbf{x}|\mathbf{p}) = \prod_{k=1}^K p_k^{x_k}$

Foundation – Sum of Multi-Class Trials

Multinomial Experiment

This is an experiment consisting of $n$ independent Categorical trials, each with the same probability vector $\mathbf{p}$ .

Multinomial Distribution

A random vector $\mathbf{X} = [X_1, \dots, X_K]^\top$ follows a Multinomial distribution if each component $X_k$ represents the total count for the $k$ -th category in $n$ trials. It is governed by parameters $n$ and $\mathbf{p}$ . We write $\mathbf{X} \sim \text{Multinomial}(n, \mathbf{p})$ .

Probability Mass Function (PMF): The probability of observing exactly $x_1$ outcomes of category 1, $x_2$ of category 2, …, up to $x_K$ of category $K$ (where $\sum x_k = n$ ) is given by:

$p(x_1, \dots, x_K | n, \mathbf{p}) = \frac{n!}{x_1! x_2! \cdots x_K!} \prod_{k=1}^K p_k^{x_k}$

The first term is the multinomial coefficient, which counts the number of ways to arrange the outcomes.

Moments:
- Expectation: $\mathbb{E}[X_k] = np_k$
- Variance: $\text{Var}(X_k) = np_k(1-p_k)$ (Each marginal is a Binomial)
- Covariance: $\text{Cov}(X_i, X_j) = -np_i p_j$ for $i \neq j$ . The counts are negatively correlated because an increase in the count of one category must come at the expense of another.
Sources
1. Mathematics for Machine Learning (link)