Central Limit Theorem (CLT)

Lily Koffman

Department of Biostatistics, Johns Hopkins School of Public Health

Learning objectives

  1. Describe how changing sample size impacts the distribution of the sample mean

  2. Explain how the Central Limit Theorem quantifies the uncertainty of a sample mean

Question: what is the average daily caffeine consumption among Americans?

How might we answer this question?

We could survey every single American

We could take a sample

Sample one person

Participant ID Age (years) Caffeine intake (mg)
134718 10 13

Add another person to sample and calculate average

Participant ID Age (years) Caffeine intake (mg) Sample mean (cumulative)
134718 10 13 13
132129 10 121 67

Add more people to sample

Participant ID Age (years) Caffeine intake (mg) Sample mean (cumulative)
134718 10 13 13
132129 10 121 67
132137 43 102 79
141998 52 144 95
133390 49 192 114
140734 6 4 96
136086 15 0 82
135573 70 202 97
135693 30 360 126
140558 59 0 114

Lily’s sample

Lily’s sample

What if someone else went out and sampled different people?

A different sample

A third sample

How do the samples compare?

What will happen if 100 different people take samples?

What will happen if 500 different people take samples?

Seems like the means are approaching some number

But there’s still some variability

Distribution of sample means at n = 100

Distribution of sample means at n = 100

What does the distribution of sample means look like at n = 50?

Distribution of sample means at n = 50

Distribution of sample means at n = 50

Comparison of sample means for n = 50 and n = 100

Let’s compare the distribution of sample means for a range of sample sizes

Recap so far

  • We wanted to use a sample to learn about a population
  • So far, we’ve taken a lot of different samples of size \(n\)
  • We saw that the mean of those samples were Normal, as long as \(n\) is big enough
  • In reality
    • We only get one sample and sample mean (\(\bar{X}_n\))
    • We never know the true population mean \((\mu)\)
    • How far away is our sample mean from the truth?

Central Limit Theorem

  • Central Limit Theorem: answers this question

  • As the sample size \(n\) increases, the distribution of the sample mean \(\bar{X}_n\) approaches a Normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\)

    • \(\mu\): population mean
    • \(\sigma^2\): population variance

\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]

The Central Limit Theorem

The Central Limit Theorem

The means of our earlier samples were draws from this Normal distribution

Central Limit Theorem

If we only observe one sample, how can we actually make a statement about the population?

Teaser for next time: inference (confidence intervals)

Using the CLT for inference: confidence intervals

Using the CLT for inference: confidence intervals

Questions?

Appendix

Distribution of sample mean: hours worked

Distribution of the sample mean: S&P 500 Volume

Other helpful interactive demonstrations: Stats calculators, Seeing theory

History: coin flips

De Moivre and (later) Laplace connect outcomes of coin flips and the Normal distribution

History: Quincunx

Quincunx Machine (Galton Board)

Play more with the simulation here

Formal CLT

Let \(X_1, X_2, \dots, X_n\) be a sequence of independent and identically distributed (i.i.d) random variables with: \[E[X_i] = \mu \quad \text{and} \quad 0 < Var(X_i) = \sigma^2 < \infty\]


Let \(\bar{X}_n\) be the sample mean: \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i\)


As the sample size \(n\) approaches infinity, the distribution of the sample mean converges to a Normal distribution:

\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]

What assumptions do we need for CLT?

  • Independence: the value of one observation must not influence the value of another
    • Violation: siblings, household members, eyes
  • Identically distributed: all observations come from the same underlying population
    • Violation: sample from U.S. and Europe
  • Finite mean and variance: the underlying population must have a defined mean and variance
    • Violation: some theoretical distributions (e.g., Cauchy) where the CLT never applies; not common in real data