Estimation Theory (The Principles of Statistical Parameter Estimation)

What’s there?

As an introduction to ML, I need to understand estimation theory – MLE, MAP and Bayesian Estimation.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Maximum Likelihood Estimation

We have covered this in brief earlier as well in PCA, Spectral Clustering and t-SNE.

Foundational principle

This answers

“What set of parameter values makes the observed data most probable?”

Likelihood Function: We begin with a parameterized probability model $p(\mathcal{D} | \boldsymbol{\theta})$ , where $\mathcal{D} = \{\mathbf{x}_n\}_{n=1}^N$ is the observed dataset and $\boldsymbol{\theta}$ is the vector of model parameters. The likelihood function, $L(\boldsymbol{\theta}; \mathcal{D})$ , is defined as this probability, but viewed as a function of the parameters $\boldsymbol{\theta}$ , holding the data $\mathcal{D}$ fixed:

$L(\boldsymbol{\theta}; \mathcal{D}) = p(\mathcal{D} | \boldsymbol{\theta})$

The principle of maximum likelihood states that the optimal estimate for the parameters, $\hat{\boldsymbol{\theta}}_{\text{ML}}$ , is the value that maximizes the likelihood function.

$\hat{\boldsymbol{\theta}}_{\text{ML}} = \arg\max_{\boldsymbol{\theta}} L(\boldsymbol{\theta}; \mathcal{D})$

How is it usually done?

For computational stability and mathematical convenience, we almost always work with the log-likelihood. Since the logarithm is a strictly monotonically increasing function, maximizing the likelihood is equivalent to maximizing the log-likelihood. For an i.i.d. dataset, the likelihood is a product, which the logarithm turns into a sum:

$\hat{\boldsymbol{\theta}}_{\text{ML}} = \arg\max_{\boldsymbol{\theta}} \sum_{n=1}^N \log p(\mathbf{x}_n | \boldsymbol{\theta})$

Overfitting

MLE’s primary drawback is its tendency to overfit, especially with small datasets. It only considers the data it has seen, with no mechanism for regularization or expressing prior beliefs.

So, say, we flipped a coin three times and got only heads; this is our dataset. Then, our MLE estimation will always predict a head.

Maximum a Posteriori (MAP) Estimation

MAP estimation extends MLE by incorporating prior beliefs about the parameters.

Foundational principles

The Bayesian Framework: Unlike the frequentist view, the Bayesian perspective treats the model parameters $\boldsymbol{\theta}$ not as fixed constants, but as random variables about which we can have beliefs.

The Prior Distribution ( $p(\boldsymbol{\theta})$ ): We encode our initial beliefs about the parameters before seeing any data in a prior probability distribution, $p(\boldsymbol{\theta})$ . This allows us to specify which parameter values are more or less plausible.

For instance, a Gaussian prior centered at zero, $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \alpha^2\mathbf{I})$ , expresses a belief that smaller parameter values are more likely.

The Posterior Distribution ( $p(\boldsymbol{\theta}|\mathcal{D})$ ): After observing data $\mathcal{D}$ , we update our beliefs using Bayes’ theorem to obtain the posterior distribution:

$p(\boldsymbol{\theta} | \mathcal{D}) = \frac{p(\mathcal{D} | \boldsymbol{\theta}) \, p(\boldsymbol{\theta})}{p(\mathcal{D})} \propto \text{Likelihood} \times \text{Prior}$

The MAP Principle: MAP estimation does not use the full posterior distribution. Instead, it seeks the single “best” point estimate for $\boldsymbol{\theta}$ by finding the mode of the posterior distribution. This is the point of highest probability density.

$\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\max_{\boldsymbol{\theta}} p(\boldsymbol{\theta} | \mathcal{D}) = \arg\max_{\boldsymbol{\theta}} \left[ p(\mathcal{D} | \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \right]$

Connection to regularization

In linear regression, if we place a Gaussian prior on the parameters, $\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \alpha^2\mathbf{I})$ , the log-prior is $\log p(\boldsymbol{\theta}) = -\frac{1}{2\alpha^2}|\boldsymbol{\theta}|^2 + \text{const}$ . Maximizing the log-posterior is equivalent to minimizing the negative log-posterior –

$\hat{\boldsymbol{\theta}}_{\text{MAP}} = \arg\min_{\boldsymbol{\theta}} \left[ \sum_{n=1}^N (y_n - \mathbf{x}_n^\top\boldsymbol{\theta})^2 + \lambda |\boldsymbol{\theta}|^2 \right] \quad (\text{where } \lambda \propto 1/\alpha^2)$

This is the objective function for Ridge Regression (L2 regularization).

MAP estimation with a Gaussian prior is equivalent to L2 regularization.

Is it perfect?

Nope.

MAP still only provides a point estimate ( $\hat{\boldsymbol{\theta}}_{\text{MAP}}$ ). It finds the most probable parameters but discards all other information contained in the posterior distribution, such as its variance or shape.

It doesn’t quantify our uncertainty about the parameters.

Full Bayesian Estimation

“Full Bayesian estimation embraces the entire philosophy of the Bayesian framework. It does not seek a single best point estimate for the parameters; instead, its goal is to compute and use the entire posterior distribution.”

See? So good. Literally standing on the shoulders of giants all the time.

Basics

The primary goal is not parameter estimation itself, but making predictions.

Bayesian estimation propagates the full parameter uncertainty through to the predictions. The output of the learning process is not a single value $\hat{\boldsymbol{\theta}}$ , but the full probability distribution $p(\boldsymbol{\theta}|\mathcal{D})$ .

Theoretical development

The Predictive Distribution: To make a prediction for a new data point $\mathbf{x}_*$ , we do not use a single point estimate. Instead, we compute the posterior predictive distribution by averaging the predictions of all possible parameter values, weighted by their posterior probability:

$p(y_* | \mathbf{x}_*, \mathcal{D}) = \int p(y_* | \mathbf{x}_*, \boldsymbol{\theta}) \, p(\boldsymbol{\theta} | \mathcal{D}) \, d\boldsymbol{\theta}$

This can be written as an expectation:

$p(y_* | \mathbf{x}_*, \mathcal{D}) = \mathbb{E}_{\boldsymbol{\theta} \sim p(\boldsymbol{\theta}|\mathcal{D})} [p(y_* | \mathbf{x}_*, \boldsymbol{\theta})]$

The Procedure: The core task of full Bayesian estimation is integration, not optimization. We must solve the integral above, as well as the integral in the denominator of Bayes’ theorem (the marginal likelihood or evidence), $p(\mathcal{D}) = \int p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta})d\boldsymbol{\theta}$ .

Summary

Method MLE MAP Full Bayesian
Core Principle Maximize Likelihood Maximize posterior Maximize Posterior
View of Parameters $\boldsymbol{\theta}$ Fixed, unknown constant Random variable (but we only find its mode) Random variable (we use its full distribution)
Core Task Optimization Optimization (Regularized) Integration
Output of “Learning” A point estimate, $\hat{\boldsymbol{\theta}}_{\text{ML}}$ A point estimate, $\hat{\boldsymbol{\theta}}_{\text{MAP}}$ A probability distribution, $p(\boldsymbol{\theta})$
Handles Overfitting? No (prone to overfitting) Yes (via the prior, which acts as a regularizer) Yes (by averaging over all parameters, it naturally avoids overfitting)
Quantifies Uncertainty? No No Yes (the primary advantage)

Method	MLE	MAP	Full Bayesian
Core Principle	Maximize Likelihood	Maximize posterior	Maximize Posterior
View of Parameters $\boldsymbol{\theta}$	Fixed, unknown constant	Random variable (but we only find its mode)	Random variable (we use its full distribution)
Core Task	Optimization	Optimization (Regularized)	Integration
Output of “Learning”	A point estimate, $\hat{\boldsymbol{\theta}}_{\text{ML}}$	A point estimate, $\hat{\boldsymbol{\theta}}_{\text{MAP}}$	A probability distribution, $p(\boldsymbol{\theta})$
Handles Overfitting?	No (prone to overfitting)	Yes (via the prior, which acts as a regularizer)	Yes (by averaging over all parameters, it naturally avoids overfitting)
Quantifies Uncertainty?	No	No	Yes (the primary advantage)

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts