Central Limit Theorem (CLT)

Lily Koffman

Department of Biostatistics, Johns Hopkins School of Public Health

Learning objectives

Describe how changing sample size impacts the distribution of the sample mean
Explain how the Central Limit Theorem quantifies the uncertainty of a sample mean

Question: what is the average daily caffeine consumption among Americans?

How might we answer this question?

We could survey every single American

We could take a sample

Sample one person

Participant ID	Age (years)	Caffeine intake (mg)
134718	10	13

Add another person to sample and calculate average

Participant ID	Age (years)	Caffeine intake (mg)	Sample mean (cumulative)
134718	10	13	13
132129	10	121	67

Add more people to sample

Participant ID	Age (years)	Caffeine intake (mg)	Sample mean (cumulative)
134718	10	13	13
132129	10	121	67
132137	43	102	79
141998	52	144	95
133390	49	192	114
140734	6	4	96
136086	15	0	82
135573	70	202	97
135693	30	360	126
140558	59	0	114

Lily’s sample

What if someone else went out and sampled different people?

A different sample

A third sample

How do the samples compare?

What will happen if 100 different people take samples?

What will happen if 500 different people take samples?

Seems like the means are approaching some number

But there’s still some variability

Distribution of sample means at n = 100

What does the distribution of sample means look like at n = 50?

Distribution of sample means at n = 50

Comparison of sample means for n = 50 and n = 100

Let’s compare the distribution of sample means for a range of sample sizes

data = FileAttachment("caffeine_df.csv").csv({ typed: true })

<!-- viewof sample_size = Inputs.range( -->
<!--   [2, 120],  -->
<!--   {value: 2, step: 1, label: "Sample size:"} -->
<!-- ) -->

sample_values = [2, 10, 25, 50, 75, 100, 200, 500]

viewof sample_index = Inputs.range(
  [0, sample_values.length - 1], // Slide from index 0 to 7
  {
    value: 0, // 
    step: 1, 
    label: "Sample size:",
    // This makes the label show "50" instead of "3"
    format: i => sample_values[i] 
  }
)

sample_size = sample_values[sample_index]


sample_means = {
  // Dependency tracking: re-run when these change
  sample_size;
  
  // Safety check: ensure data is loaded
  if (!data || data.length === 0) return [];
  
  // Safety check: ensure column exists (Case Sensitive!)
  if (typeof data[0].caffeine_mg === "undefined") {
    return ["Error: Column 'caffeine_mg' not found in data"];
  }

  const means = [];
  // Create a local copy of the array to shuffle
  // We do this inside the loop or use a specific shuffler to ensure 
  // we are sampling from the full population every time.
  const popArray = Array.from(data); 

  for (let i = 0; i < 500; i++) {
    // 1. Shuffle a clean copy of the population
    // 2. Slice the first n elements (Sampling WITHOUT replacement)
    const sample = d3.shuffle([...popArray]).slice(0, sample_size);

    // 3. Compute mean
    const m = d3.mean(sample, d => d.caffeine_mg);
    means.push(m);
  }

  return means;
}


// 5. CALCULATE NORMAL CURVE (THEORETICAL)
// We now generate the curve based on the Population Parameters, not the sample.
// This shows the Theoretical prediction of the CLT.
normalCurve = {
  if (sample_means.length < 2 || !data) return [];
  
  // 1. Get Population Statistics
  const popMean = d3.mean(data, d => d.caffeine_mg);
  const popSD = d3.deviation(data, d => d.caffeine_mg);
  
  // 2. Calculate Standard Error (The predicted width of the sampling distribution)
  // Formula: Population SD / sqrt(n)
  const se = popSD / Math.sqrt(sample_size);

  // 3. Determine range for the curve (centered on popMean)
  const minX = popMean - (4 * se);
  const maxX = popMean + (4 * se);
  
  // Create a smooth grid of X values
  const grid = d3.range(minX, maxX, (maxX - minX) / 200);
  
  // Normal Density Function using Population Mean and Standard Error
  const pdf = (x) => (1 / (se * Math.sqrt(2 * Math.PI))) * Math.exp(-0.5 * Math.pow((x - popMean) / se, 2));
  
  return grid.map(x => ({x: x, y: pdf(x)}));
}

// 6. PLOT
// 6. SIDE-BY-SIDE PLOTS WITH EXTERNAL LEGEND
{
  // --- A. SETUP VARIABLES ---
  const popMean = d3.mean(data, d => d.caffeine_mg);
  const n = sample_means.length;
  const bins = d3.bin().thresholds(20)(sample_means);
  
  // NEW: Calculate fixed x-axis limits based on the Population Data
  // This ensures the axis stays constant regardless of sample size
  const xLimits = d3.extent(data, d => d.caffeine_mg);
  const xLimits2 = [0, 750];

  // Define colors
  const colorDomain = ["Sampling Distribution Mean", "Population Mean", "Normal Curve"];
  const colorRange = ["black", "#E69F00", "#D55E00"];

  // --- B. POPULATION PLOT (Left) ---
  const popPlot = Plot.plot({
    width: 550,
    height: 500,
    caption: "Population Distribution (Raw Data)",
    marginTop: 40,
    marginLeft: 50,
    marginBottom: 40,
    style: { fontSize: "14px", overflow: "visible" },
    
    marks: [
      Plot.rectY(data, Plot.binX({y: "count"}, {x: "caffeine_mg", fill: "#444444", fillOpacity: 0.5})),
    ],
    // We explicitly set the domain here too, just to be safe and consistent
    x: { label: "Caffeine consumption (mg)", domain: xLimits }, 
    y: { label: "Count" }
  });

  // --- C. SAMPLING DISTRIBUTION PLOT (Right) ---
  const sampPlot = Plot.plot({
    width: 550,
    height: 500,
    caption: `Distribution of sample means at (n=${sample_size})`,
    marginTop: 40,
    marginLeft: 50,
    marginBottom: 40,
    style: { fontSize: "14px", overflow: "visible" },
    
    color: {
      domain: colorDomain,
      range: colorRange,
      legend: false
    },
    marks: [
      Plot.rectY(bins, {
        x1: "x0", x2: "x1", 
        y: d => d.length / (n * (d.x1 - d.x0)), 
        fill: "#444444", fillOpacity: 0.3, tip: true,
        title: d => `Count: ${d.length}\nDensity: ${(d.length / (n * (d.x1 - d.x0))).toFixed(3)}`
      }),
      Plot.line(normalCurve, {x: "x", y: "y", stroke: () => "Normal Curve", strokeWidth: 3}),
      Plot.ruleX([d3.mean(sample_means)], {stroke: () => "Sampling Distribution Mean", strokeWidth: 3}),
      Plot.ruleX([popMean], {stroke: () => "Population Mean", strokeDasharray: "4,4", strokeWidth: 2, opacity: 0.8})
    ],
    // NEW: Apply the fixed population domain here
    // x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits }, 
    x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits2 },
    y: { label: "Density" }
  });

  // --- D. LEGEND & LAYOUT ---
  const myLegend = Plot.legend({
    color: { domain: colorDomain, range: colorRange },
    style: { marginBottom: "10px", fontSize: "14px" }
  });

  return html`
  <div style="display: flex; flex-direction: column; align-items: center;">
    <div style="width: 100%; max-width: 920px; display: flex; justify-content: flex-end;">
      ${myLegend}
    </div>
    <div style="display: flex; flex-wrap: wrap; gap: 20px; justify-content: center;">
      <div>${popPlot}</div>
      <div>${sampPlot}</div>
    </div>
  </div>`;
}

Recap so far

We wanted to use a sample to learn about a population
So far, we’ve taken a lot of different samples of size \(n\)
We saw that the mean of those samples were Normal, as long as \(n\) is big enough
In reality
- We only get one sample and sample mean (\(\bar{X}_n\))
- We never know the true population mean \((\mu)\)
- How far away is our sample mean from the truth?

Central Limit Theorem

Central Limit Theorem: answers this question
As the sample size \(n\) increases, the distribution of the sample mean \(\bar{X}_n\) approaches a Normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\)
- \(\mu\): population mean
- \(\sigma^2\): population variance

\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]

The Central Limit Theorem

The means of our earlier samples were draws from this Normal distribution

Central Limit Theorem

If we only observe one sample, how can we actually make a statement about the population?

Teaser for next time: inference (confidence intervals)

Using the CLT for inference: confidence intervals

Questions?

Appendix

Distribution of sample mean: hours worked

data2 = FileAttachment("work_df.csv").csv({ typed: true })
sample_values2 = [2, 10, 25, 50, 75, 100, 200, 500]

viewof sample_index2 = Inputs.range(
  [0, sample_values2.length - 1], // Slide from index 0 to 7
  {
    value: 0, // Index 3 corresponds to value "50" (your previous default)
    step: 1, 
    label: "Sample size:",
    // This makes the label show "50" instead of "3"
    format: i => sample_values2[i] 
  }
)

sample_size2 = sample_values2[sample_index2]

sample_means2 = {
  // Dependency tracking: re-run when these change
  sample_size2;
  
  // Safety check: ensure data is loaded
  if (!data2 || data2.length === 0) return [];
  
  // Safety check: ensure column exists (Case Sensitive!)
  if (typeof data2[0].hrs_worked === "undefined") {
    return ["Error: Column 'hrs_worked' not found in data"];
  }

  const means = [];
  // Create a local copy of the array to shuffle
  // We do this inside the loop or use a specific shuffler to ensure 
  // we are sampling from the full population every time.
  const popArray = Array.from(data2); 

  for (let i = 0; i < 500; i++) {
    // 1. Shuffle a clean copy of the population
    // 2. Slice the first n elements (Sampling WITHOUT replacement)
    const sample = d3.shuffle([...popArray]).slice(0, sample_size2);

    // 3. Compute mean
    const m = d3.mean(sample, d => d.hrs_worked);
    means.push(m);
  }

  return means;
}


// 5. CALCULATE NORMAL CURVE (THEORETICAL)
// We now generate the curve based on the Population Parameters, not the sample.
// This shows the Theoretical prediction of the CLT.
normalCurve2 = {
  if (sample_means2.length < 2 || !data2) return [];
  
  // 1. Get Population Statistics
  const popMean = d3.mean(data2, d => d.hrs_worked);
  const popSD = d3.deviation(data2, d => d.hrs_worked);
  
  // 2. Calculate Standard Error (The predicted width of the sampling distribution)
  // Formula: Population SD / sqrt(n)
  const se = popSD / Math.sqrt(sample_size2);

  // 3. Determine range for the curve (centered on popMean)
  const minX = popMean - (4 * se);
  const maxX = popMean + (4 * se);
  
  // Create a smooth grid of X values
  const grid = d3.range(minX, maxX, (maxX - minX) / 200);
  
  // Normal Density Function using Population Mean and Standard Error
  const pdf = (x) => (1 / (se * Math.sqrt(2 * Math.PI))) * Math.exp(-0.5 * Math.pow((x - popMean) / se, 2));
  
  return grid.map(x => ({x: x, y: pdf(x)}));
}

// 6. PLOT
// 6. SIDE-BY-SIDE PLOTS WITH EXTERNAL LEGEND
{
  // --- A. SETUP VARIABLES ---
  const popMean = d3.mean(data2, d => d.hrs_worked);
  const n = sample_means2.length;
  const bins = d3.bin().thresholds(20)(sample_means2);
  
  // NEW: Calculate fixed x-axis limits based on the Population Data
  // This ensures the axis stays constant regardless of sample size
  const xLimits = d3.extent(data2, d => d.hrs_worked);

  // Define colors
  const colorDomain = ["Sampling Distribution Mean", "Population Mean", "Normal Curve"];
  const colorRange = ["black", "#E69F00", "#D55E00"];

  // --- B. POPULATION PLOT (Left) ---
  const popPlot = Plot.plot({
    width: 550,
    height: 500,
    caption: "Population Distribution (Raw Data)",
    marginTop: 40,
    marginLeft: 50,
    marginBottom: 40,
    style: { fontSize: "14px", overflow: "visible" },
    
    marks: [
      Plot.rectY(data2, Plot.binX({y: "count"}, {x: "hrs_worked", fill: "#444444", fillOpacity: 0.5})),
    ],
    // We explicitly set the domain here too, just to be safe and consistent
    x: { label: "Hours worked", domain: xLimits }, 
    y: { label: "Count" }
  });

  // --- C. SAMPLING DISTRIBUTION PLOT (Right) ---
  const sampPlot = Plot.plot({
    width: 550,
    height: 500,
    caption: `Sampling Distribution (n=${sample_size2})`,
    marginTop: 40,
    marginLeft: 50,
    marginBottom: 40,
    style: { fontSize: "14px", overflow: "visible" },
    
    color: {
      domain: colorDomain,
      range: colorRange,
      legend: false
    },
    marks: [
      Plot.rectY(bins, {
        x1: "x0", x2: "x1", 
        y: d => d.length / (n * (d.x1 - d.x0)), 
        fill: "#444444", fillOpacity: 0.3, tip: true,
        title: d => `Count: ${d.length}\nDensity: ${(d.length / (n * (d.x1 - d.x0))).toFixed(3)}`
      }),
      Plot.line(normalCurve2, {x: "x", y: "y", stroke: () => "Normal Curve", strokeWidth: 3}),
      Plot.ruleX([d3.mean(sample_means2)], {stroke: () => "Sampling Distribution Mean", strokeWidth: 3}),
      Plot.ruleX([popMean], {stroke: () => "Population Mean", strokeDasharray: "4,4", strokeWidth: 2, opacity: 0.8})
    ],
    // NEW: Apply the fixed population domain here
    x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits }, 
    y: { label: "Density" }
  });

  // --- D. LEGEND & LAYOUT ---
  const myLegend = Plot.legend({
    color: { domain: colorDomain, range: colorRange },
    style: { marginBottom: "10px", fontSize: "14px" }
  });

  return html`
  <div style="display: flex; flex-direction: column; align-items: center;">
    <div style="width: 100%; max-width: 920px; display: flex; justify-content: flex-end;">
      ${myLegend}
    </div>
    <div style="display: flex; flex-wrap: wrap; gap: 20px; justify-content: center;">
      <div>${popPlot}</div>
      <div>${sampPlot}</div>
    </div>
  </div>`;
}

Distribution of the sample mean: S&P 500 Volume

stock = FileAttachment("stocks.csv").csv({ typed: true })

<!-- viewof sample_size_st = Inputs.range( -->
<!--   [2, 100],  -->
<!--   {value: 2, step: 1, label: "Sample size:"} -->
<!-- ) -->

sample_values3 = [2, 10, 25, 50, 75, 100, 200, 500]

viewof sample_index3 = Inputs.range(
  [0, sample_values3.length - 1], // Slide from index 0 to 7
  {
    value: 0, // Index 3 corresponds to value "50" (your previous default)
    step: 1, 
    label: "Sample size:",
    // This makes the label show "50" instead of "3"
    format: i => sample_values3[i] 
  }
)

sample_size_st = sample_values3[sample_index3]


sample_means_stock = {
  // Dependency tracking: re-run when these change
  sample_size_st;
  
  // Safety check: ensure stock is loaded
  if (!stock || stock.length === 0) return [];
  
  // Safety check: ensure column exists (Case Sensitive!)
  if (typeof stock[0].sp500_volume === "undefined") {
    return ["Error: Column 'sp500_volume' not found in stock"];
  }

  const means = [];
  // Create a local copy of the array to shuffle
  // We do this inside the loop or use a specific shuffler to ensure 
  // we are sampling from the full population every time.
  const popArray = Array.from(stock); 

  for (let i = 0; i < 500; i++) {
    // 1. Shuffle a clean copy of the population
    // 2. Slice the first n elements (Sampling WITHOUT replacement)
    const sample = d3.shuffle([...popArray]).slice(0, sample_size_st);

    // 3. Compute mean
    const m = d3.mean(sample, d => d.sp500_volume);
    means.push(m);
  }

  return means;
}


// 5. CALCULATE NORMAL CURVE (THEORETICAL)
// We now generate the curve based on the Population Parameters, not the sample.
// This shows the Theoretical prediction of the CLT.
normalCurve_stock = {
  if (sample_means_stock.length < 2 || !stock) return [];
  
  // 1. Get Population Statistics
  const popMean = d3.mean(stock, d => d.sp500_volume);
  const popSD = d3.deviation(stock, d => d.sp500_volume);
  
  // 2. Calculate Standard Error (The predicted width of the sampling distribution)
  // Formula: Population SD / sqrt(n)
  const se = popSD / Math.sqrt(sample_size_st);

  // 3. Determine range for the curve (centered on popMean)
  const minX = popMean - (4 * se);
  const maxX = popMean + (4 * se);
  
  // Create a smooth grid of X values
  const grid = d3.range(minX, maxX, (maxX - minX) / 200);
  
  // Normal Density Function using Population Mean and Standard Error
  const pdf = (x) => (1 / (se * Math.sqrt(2 * Math.PI))) * Math.exp(-0.5 * Math.pow((x - popMean) / se, 2));
  
  return grid.map(x => ({x: x, y: pdf(x)}));
}

// 6. PLOT
// 6. SIDE-BY-SIDE PLOTS WITH EXTERNAL LEGEND
{
  // --- A. SETUP VARIABLES ---
  const popMean = d3.mean(stock, d => d.sp500_volume);
  const n = sample_means_stock.length;
  const bins = d3.bin().thresholds(20)(sample_means_stock);
  
  // NEW: Calculate fixed x-axis limits based on the Population stock
  // This ensures the axis stays constant regardless of sample size
  const xLimits = d3.extent(stock, d => d.sp500_volume);

  // Define colors
  const colorDomain = ["Sampling Distribution Mean", "Population Mean", "Normal Curve"];
  const colorRange = ["black", "#E69F00", "#D55E00"];

  // --- B. POPULATION PLOT (Left) ---
  const popPlot = Plot.plot({
    width: 550,
    height: 500,
    caption: "Population Distribution (Raw Data)",
    marginTop: 40,
    marginLeft: 50,
    marginBottom: 40,
    style: { fontSize: "14px", overflow: "visible" },
    
    marks: [
      Plot.rectY(stock, Plot.binX({y: "count"}, {x: "sp500_volume", fill: "#444444", fillOpacity: 0.5})),
    ],
    // We explicitly set the domain here too, just to be safe and consistent
    x: { label: "S&P 500 Volume (x10^6)", domain: xLimits }, 
    y: { label: "Count" }
  });

  // --- C. SAMPLING DISTRIBUTION PLOT (Right) ---
  const sampPlot = Plot.plot({
    width: 550,
    height: 500,
    caption: `Sampling Distribution (n=${sample_size_st})`,
    marginTop: 40,
    marginLeft: 50,
    marginBottom: 40,
    style: { fontSize: "14px", overflow: "visible" },
    
    color: {
      domain: colorDomain,
      range: colorRange,
      legend: false
    },
    marks: [
      Plot.rectY(bins, {
        x1: "x0", x2: "x1", 
        y: d => d.length / (n * (d.x1 - d.x0)), 
        fill: "#444444", fillOpacity: 0.3, tip: true,
        title: d => `Count: ${d.length}\nDensity: ${(d.length / (n * (d.x1 - d.x0))).toFixed(3)}`
      }),
      Plot.line(normalCurve_stock, {x: "x", y: "y", stroke: () => "Normal Curve", strokeWidth: 3}),
      Plot.ruleX([d3.mean(sample_means_stock)], {stroke: () => "Sampling Distribution Mean", strokeWidth: 3}),
      Plot.ruleX([popMean], {stroke: () => "Population Mean", strokeDasharray: "4,4", strokeWidth: 2, opacity: 0.8})
    ],
    // NEW: Apply the fixed population domain here
    x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits }, 
    y: { label: "Density" }
  });

  // --- D. LEGEND & LAYOUT ---
  const myLegend = Plot.legend({
    color: { domain: colorDomain, range: colorRange },
    style: { marginBottom: "10px", fontSize: "14px" }
  });

  return html`
  <div style="display: flex; flex-direction: column; align-items: center;">
    <div style="width: 100%; max-width: 920px; display: flex; justify-content: flex-end;">
      ${myLegend}
    </div>
    <div style="display: flex; flex-wrap: wrap; gap: 20px; justify-content: center;">
      <div>${popPlot}</div>
      <div>${sampPlot}</div>
    </div>
  </div>`;
}

Other helpful interactive demonstrations: Stats calculators, Seeing theory

History: coin flips

De Moivre and (later) Laplace connect outcomes of coin flips and the Normal distribution

History: Quincunx

Quincunx Machine (Galton Board)

Play more with the simulation here

Formal CLT

Let \(X_1, X_2, \dots, X_n\) be a sequence of independent and identically distributed (i.i.d) random variables with: \[E[X_i] = \mu \quad \text{and} \quad 0 < Var(X_i) = \sigma^2 < \infty\]

Let \(\bar{X}_n\) be the sample mean: \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i\)

As the sample size \(n\) approaches infinity, the distribution of the sample mean converges to a Normal distribution:

\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]

What assumptions do we need for CLT?

Independence: the value of one observation must not influence the value of another
- Violation: siblings, household members, eyes
Identically distributed: all observations come from the same underlying population
- Violation: sample from U.S. and Europe
Finite mean and variance: the underlying population must have a defined mean and variance
- Violation: some theoretical distributions (e.g., Cauchy) where the CLT never applies; not common in real data