For large enough samples...

Something you might not know about the central limit theorem

Oct 16, 2024

All introductory statistics classes discuss the central limit theorem (CLT). It has a strong case for being the most important result in statistics (and maybe in all social science). It’s because of CLT that we are able to make claims about a population based on a random sample of that population. As such, you can thank it for every political poll, clinical trial, and A/B test (for you marketers). The way it’s usually presented is as follows:

Central Limit Theorem: Let X be a random variable with finite mean μ and standard deviation σ. For sufficiently large sample size N, the distribution of sample means

\(\bar{X}=\displaystyle\frac{1}{N}\sum_{i=1}^{n}X_i\)

is approximately normal with mean μ and standard deviation σ/√N.

In English, this means that if you repeatedly took random samples of size N and made a histogram out of the averages of those samples, the histogram would be a bell-shaped curve. The crucial part of the statement is that the sample means form this bell-shaped curve no matter what the population looks like. It could be normal itself, or highly skewed, or twin-peaked, etc. The only thing that matters is that your samples are large enough.

To prove the CLT, you need some pretty serious probability weapons. For that reason, almost every teacher just states it and moves on (including yours truly). However, you’ll notice that there are a few assertions in the statement of the theorem:

Sample means are approximately normally distributed for large enough N.
The expected value for the distribution of sample means is μ.
The standard deviation for the distribution of sample means is σ/√N.

Of these, the first one is difficult to prove. The latter two, on the other hand, are accessible in an intro stats course. More importantly, the statements about the mean and standard deviation do not require “large enough” samples. They are true for any value of N, including 1!

Expectation and variance of sample means

Suppose that X is any population with mean μ and standard deviation σ. Draw a random sample of N members from the population and compute the sample mean as above. Because this is a random sample of the population, each X_i has the same mean and standard deviation as X (μ and σ, respectively). Also, the value of each member of the sample is independent of all the other members.

I’ll start by calculating the expected value of the sample mean. If you expect X to be μ, then what would expect the average of a random sample from X to be? There’s no reason for it to be anything other than μ. Let’s prove that this is the case using the fact that expected value is linear (i.e., E[aX + bY] = a*E[X] + b*E[y]).

\(\begin{aligned} E[\bar{X}] &=E\left[\frac{1}{n}\sum_{i=1}^{n}X_i \right]\\[3pt] &=\frac{1}{n}\sum_{i=1}^{n}E[X_i]\\[3pt] &=\frac{1}{n}\sum_{i=1}^{n}\mu\\[3pt] &=\frac{1}{n}\cdot n\cdot \mu = \mu \end{aligned}\)

To recap: if you take a sample from a population, then the mean of that sample is expected to be the same as the mean of the entire population. Of course, because randomness is involved, your sample mean likely won’t be the same as the population. This is known as sampling error.

The fact that the sample mean might deviate from the population mean is why we care about the standard deviation too. Standard deviation tells us how much volatility there is in the data. In other words, how spread out is the data around the mean? In our problem, we know that the standard deviation for the population is σ. What we want is the standard deviation of the distribution of sample means. Actually, it’s easier to manipulate variance instead, which is standard deviation squared. Variance is defined as the expected value of squared errors from the mean.

\(\begin{aligned} \text{Var}[X]&=E[(X-\mu)^2]\\[3pt] &=E[X^2]-E[X]^2 \end{aligned}\)

The second equation is a handy way of calculating variance. It falls out from the definition if you FOIL the squared term and use the linearity of expected value. While variance is not linear, it does have some nice properties:

\(\begin{aligned} \text{Var}[aX]&=a^2\text{Var}[X],\\[3pt] \text{Var}[X+Y]&=\text{Var}[X]+\text{Var}[Y] \end{aligned}\)

The second equation holds only if X and Y are independent. Fortunately, we make that assumption when we take a random sample from the population. Why a^2 in the first equation? The reason is that variance is computed as the average of squared errors from the mean. If you take the standard deviation instead—by taking the square root of variance—you’ll be left with just |a|.

Now let’s compute the variance for the distribution of sample means.

\(\begin{aligned} \text{Var}[\bar{X}] &=\text{Var}\left[\frac{1}{n}\sum_{i=1}^{n}X_i \right]\\[3pt] &=\frac{1}{n^2}\sum_{i=1}^{n}\text{Var}[X_i]\\[3pt] &=\frac{1}{n^2}\sum_{i=1}^{n}\sigma^2\\[3pt] &=\frac{1}{n^2}\cdot n\cdot \sigma^2 = \frac{\sigma^2}{n}. \end{aligned}\)

To get the standard deviation, we take the square root:

\(\sigma_{\bar{X}}=\sqrt{\text{Var}[\bar{X}]}=\frac{\sigma}{\sqrt{n}}.\)

As you can see, computing the mean and standard deviation/variance for sample means is pretty easy. (The argument is short, at least.) You also don’t need to worry about your sample size: these parameters hold exactly for samples of any size.1 Next time you hear about the central limit theorem, remember that the real “meat” to the result is that the sample means fall into a nice bell curve if your samples are sufficiently large.

One caveat: you typically use the sample standard deviation s to approximate the population standard deviation σ. This approximation does improve as sample sizes get larger.

Squareholder Value

Discussion about this post