Confidence and Errors (Framework of Frequentist Hypothesis Testing)

What’s there?

I wanted to note down Type I and II errors, p-values and confidence intervals.

The central challenge in science and data analysis is to distinguish a real effect from random chance.

I wish to understand this.

Why do I write this?

I wanted to create an authoritative document I could refer to for later. Thought that I might as well make it public (build in public, serve society and that sort of stuff).

Errors

Structure of a hypothesis test

The Null Hypothesis ( $H_0$ ): This is the default assumption, the hypothesis of “no effect” or “no difference.” It represents the status quo. For example, “ $H_0$ : The new drug has no effect on blood pressure.” The entire framework is designed to assess the strength of evidence against this hypothesis.
The Alternative Hypothesis ( $H_A$ or $H_1$ ): This is the claim we wish to investigate. It represents a real effect or a true difference. For example, “ $H_A$ : The new drug lowers blood pressure.”

Based on the data we collect, we must make a decision: either reject the null hypothesis (in favor of the alternative) or fail to reject the null hypothesis.

The errors themselves

Given the true state of the world (which is unknown to us) and the decision we make, there are four possible outcomes, two of which are errors:

	Decision: Fail to Reject $H_0$	Decision: Reject $H_0$
Truth: $H_0$ is true	Correct Decision	Type I Error (False Positive)
Truth: $H_A$ is true	Type II Error (False Negative)	Correct Decision (True Positive)

Type I Error (False Positive):
- Definition: A Type I error occurs when we reject the null hypothesis when it is actually true.
- Probability: The probability of committing a Type I error is denoted by . This value is called the significance level of the test.
  - We control for this error by choosing a small value for $\alpha$ before conducting the experiment (commonly $\alpha = 0.05$ or $\alpha = 0.01$ ). This means we are willing to accept a 5% or 1% chance of making a false positive conclusion.
Type II Error (False Negative):
- Definition: A Type II error occurs when we fail to reject the null hypothesis when it is actually false.
- Probability: The probability of committing a Type II error is denoted by . The value is called the statistical power of the test. It is the probability of correctly detecting a real effect when it exists.
  - The value of $\beta$ is not fixed in advance but depends on the true size of the effect, the sample size, and the chosen significance level $\alpha$ .
  - There is an inherent tradeoff between and ; decreasing the chance of a Type I error generally increases the chance of a Type II error. This makes sense intuitively, right?
    - a lower α (a stricter test) makes it harder to reject the null hypothesis, thus increasing β.

p-values

First principles

The Core Question: “If the null hypothesis were true, how surprising is my data?”
The Test Statistic: We first compute a test statistic from our data (e.g., a t-statistic, a chi-squared statistic). This is a single number that summarizes the deviation of our data from what would be expected under the null hypothesis.
The Null Distribution: We must know the probability distribution that this test statistic would follow if the null hypothesis were true. This is the null distribution

Theory

Formal Definition (p-value): The p-value is the probability, assuming the null hypothesis ( $H_0$ ) is true, of observing a test statistic that is at least as extreme as the one actually observed.

$p = P(\text{Test Statistic} \ge \text{Observed Statistic} | H_0 \text{ is true})$

(Mis)interpretations

What it IS: The p-value is a measure of surprise. A small p-value means that our observed data is very surprising if the null hypothesis is true. This leads us to doubt the null hypothesis.
What it is NOT: The p-value is NOT the probability that the null hypothesis is true. This is the most common and critical misinterpretation. $p \neq P(H_0 | \text{Data})$ . Calculating $P(H_0 | \text{Data})$ requires a Bayesian approach involving a prior probability for $H_0$ .

The Decision Rule: To make a decision, we compare the p-value to our pre-specified significance level $\alpha$ .

If $p \le \alpha$ : We reject the null hypothesis. The result is deemed “statistically significant.” We conclude that there is strong evidence for the alternative hypothesis.
If $p > \alpha$ : We fail to reject the null hypothesis. The result is “not statistically significant.” We conclude that we do not have sufficient evidence to discard the null hypothesis (this is NOT the same as proving the null hypothesis is true).

With an example

You am a botanist (yay!). You have developed a new fertilizer. Does it make plants actually grow taller?

Your core question is: “If this fertilizer has no effect at all, how surprising is it that my fertilized plants grew so much taller than the unfertilized ones?”

To answer this, you set up an experiment:

You take 60 seedlings.
30 seedlings get the new fertilizer (the treatment group).
30 seedlings get only water (the control group).

After a month, you measure all the plants. You find that the fertilized plants are, on average, 2 cm taller. To standardize this result, you calculate a t-statistic. This single number boils down the difference in means, the variation in the data, and the sample size. Let’s say you calculate a t-statistic of 2.5.

Now, you need to know if t = 2.5 is a big deal. You turn to theory. Statisticians know that if the fertilizer had no effect (the null hypothesis is true), and you repeated this experiment many times, the t-statistics you’d get would form a specific bell-shaped curve called a t-distribution, centered at 0. This is your null distribution.

You check your null distribution to see how your result (t = 2.5) fits in. The p-value is the probability of getting a t-statistic that is at least as extreme as 2.5, assuming the fertilizer is useless. “Extreme” here means 2.5 or greater, OR -2.5 or less.

Let’s say the area under the tails of the curve for these extreme values is 0.012. So, your p-value = 0.012.

The p-value of 0.012 is a measure of surprise. It means that if the fertilizer had no effect, there would only be a 1.2% chance of seeing a height difference as large as the one you observed just by random luck. Because this chance is so small, you should be very surprised. This surprise leads you to strongly doubt your initial assumption that the fertilizer is useless.

This is the critical part. The p-value of 0.012 does NOT mean there’s a 1.2% chance that the null hypothesis is true (i.e., a 1.2% chance the fertilizer is useless). It’s a statement about your data’s rarity under the null hypothesis, not a statement about the hypothesis itself.

Before the experiment, you set a significance level, your threshold for surprise. Let’s use the standard $\alpha = 0.05$ .

Compare: Your p-value (0.012) is smaller than your alpha (0.05).
Decision: Since p < $\alpha$ , you reject the null hypothesis.
Conclusion: The result is “statistically significant.” You have strong evidence to conclude that your new fertilizer does, in fact, make plants grow taller.

What if your p-value was 0.34?
In that case, since 0.34 > 0.05, you would fail to reject the null hypothesis. This doesn’t prove the fertilizer is useless. It just means this particular experiment didn’t provide strong enough evidence to convince you that it works.

Confidence intervals

Fundamentals

The Problem: A point estimate (like the sample mean $\bar{x}$ ) is our single best guess for an unknown population parameter (the true mean $\mu$ ). However, we know this estimate is almost certainly not exactly correct due to sampling variability. A confidence interval provides a range of plausible values for the true parameter.
The Frequentist Perspective: This is a subtle but crucial point. From a frequentist perspective, the true population parameter $\mu$ is a fixed, unknown constant. It is not a random variable. The confidence interval, however, is random. Its endpoints are calculated from the random sample, so they would be different if we were to repeat the experiment.

Theory

Formal Definition (Confidence Interval): A $100(1-\alpha)\%$ confidence interval for a parameter $\theta$ is an interval $[L, U]$ , computed from the sample, such that if we were to repeat the sampling process an infinite number of times, $100(1-\alpha)\%$ of the thus-constructed intervals would contain the true, fixed parameter $\theta$ .

$P(L \le \theta \le U) = 1-\alpha$

This probability statement is about the procedure of constructing the interval, not about the specific interval we have calculated.

Example

Construction (Example for a Mean): A confidence interval for a population mean $\mu$ is typically constructed as:

$\text{Point Estimate} \pm \text{Margin of Error}$

$\bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}$

where:

$\bar{x}$ is the sample mean.
$s$ is the sample standard deviation.
$n$ is the sample size.
$t^*$ is the critical value from the Student’s t-distribution with $n-1$ degrees of freedom, chosen such that the area between $-t^*$ and $+t^*$ is $1-\alpha$ .

(Mis)interpretations

What it IS: If we calculate a 95% confidence interval for the mean to be [10.2, 11.8], the correct interpretation is: “We are 95% confident that the procedure we used to generate this interval captures the true population mean.” It is a statement about the reliability of our method.
What it is NOT: It is NOT correct to say: “There is a 95% probability that the true mean $\mu$ lies within the interval [10.2, 11.8].” This is incorrect because in the frequentist paradigm, the true mean $\mu$ is a fixed constant, not a random variable; it is either in the interval or it is not. A probability statement about a fixed constant is either 0 or 1.

Sources

Mathematics for Machine Learning (link)

Vineeth Bhat

recent posts