| Participant ID | Age (years) | Caffeine intake (mg) |
|---|---|---|
| 134718 | 10 | 13 |
Department of Biostatistics, Johns Hopkins School of Public Health
Describe how changing sample size impacts the distribution of the sample mean
Explain how the Central Limit Theorem quantifies the uncertainty of a sample mean
| Participant ID | Age (years) | Caffeine intake (mg) |
|---|---|---|
| 134718 | 10 | 13 |
| Participant ID | Age (years) | Caffeine intake (mg) | Sample mean (cumulative) |
|---|---|---|---|
| 134718 | 10 | 13 | 13 |
| 132129 | 10 | 121 | 67 |
| Participant ID | Age (years) | Caffeine intake (mg) | Sample mean (cumulative) |
|---|---|---|---|
| 134718 | 10 | 13 | 13 |
| 132129 | 10 | 121 | 67 |
| 132137 | 43 | 102 | 79 |
| 141998 | 52 | 144 | 95 |
| 133390 | 49 | 192 | 114 |
| 140734 | 6 | 4 | 96 |
| 136086 | 15 | 0 | 82 |
| 135573 | 70 | 202 | 97 |
| 135693 | 30 | 360 | 126 |
| 140558 | 59 | 0 | 114 |
data = FileAttachment("caffeine_df.csv").csv({ typed: true })
<!-- viewof sample_size = Inputs.range( -->
<!-- [2, 120], -->
<!-- {value: 2, step: 1, label: "Sample size:"} -->
<!-- ) -->
sample_values = [2, 10, 25, 50, 75, 100, 200, 500]
viewof sample_index = Inputs.range(
[0, sample_values.length - 1], // Slide from index 0 to 7
{
value: 0, //
step: 1,
label: "Sample size:",
// This makes the label show "50" instead of "3"
format: i => sample_values[i]
}
)
sample_size = sample_values[sample_index]
sample_means = {
// Dependency tracking: re-run when these change
sample_size;
// Safety check: ensure data is loaded
if (!data || data.length === 0) return [];
// Safety check: ensure column exists (Case Sensitive!)
if (typeof data[0].caffeine_mg === "undefined") {
return ["Error: Column 'caffeine_mg' not found in data"];
}
const means = [];
// Create a local copy of the array to shuffle
// We do this inside the loop or use a specific shuffler to ensure
// we are sampling from the full population every time.
const popArray = Array.from(data);
for (let i = 0; i < 500; i++) {
// 1. Shuffle a clean copy of the population
// 2. Slice the first n elements (Sampling WITHOUT replacement)
const sample = d3.shuffle([...popArray]).slice(0, sample_size);
// 3. Compute mean
const m = d3.mean(sample, d => d.caffeine_mg);
means.push(m);
}
return means;
}
// 5. CALCULATE NORMAL CURVE (THEORETICAL)
// We now generate the curve based on the Population Parameters, not the sample.
// This shows the Theoretical prediction of the CLT.
normalCurve = {
if (sample_means.length < 2 || !data) return [];
// 1. Get Population Statistics
const popMean = d3.mean(data, d => d.caffeine_mg);
const popSD = d3.deviation(data, d => d.caffeine_mg);
// 2. Calculate Standard Error (The predicted width of the sampling distribution)
// Formula: Population SD / sqrt(n)
const se = popSD / Math.sqrt(sample_size);
// 3. Determine range for the curve (centered on popMean)
const minX = popMean - (4 * se);
const maxX = popMean + (4 * se);
// Create a smooth grid of X values
const grid = d3.range(minX, maxX, (maxX - minX) / 200);
// Normal Density Function using Population Mean and Standard Error
const pdf = (x) => (1 / (se * Math.sqrt(2 * Math.PI))) * Math.exp(-0.5 * Math.pow((x - popMean) / se, 2));
return grid.map(x => ({x: x, y: pdf(x)}));
}
// 6. PLOT
// 6. SIDE-BY-SIDE PLOTS WITH EXTERNAL LEGEND
{
// --- A. SETUP VARIABLES ---
const popMean = d3.mean(data, d => d.caffeine_mg);
const n = sample_means.length;
const bins = d3.bin().thresholds(20)(sample_means);
// NEW: Calculate fixed x-axis limits based on the Population Data
// This ensures the axis stays constant regardless of sample size
const xLimits = d3.extent(data, d => d.caffeine_mg);
const xLimits2 = [0, 750];
// Define colors
const colorDomain = ["Sampling Distribution Mean", "Population Mean", "Normal Curve"];
const colorRange = ["black", "#E69F00", "#D55E00"];
// --- B. POPULATION PLOT (Left) ---
const popPlot = Plot.plot({
width: 550,
height: 500,
caption: "Population Distribution (Raw Data)",
marginTop: 40,
marginLeft: 50,
marginBottom: 40,
style: { fontSize: "14px", overflow: "visible" },
marks: [
Plot.rectY(data, Plot.binX({y: "count"}, {x: "caffeine_mg", fill: "#444444", fillOpacity: 0.5})),
],
// We explicitly set the domain here too, just to be safe and consistent
x: { label: "Caffeine consumption (mg)", domain: xLimits },
y: { label: "Count" }
});
// --- C. SAMPLING DISTRIBUTION PLOT (Right) ---
const sampPlot = Plot.plot({
width: 550,
height: 500,
caption: `Distribution of sample means at (n=${sample_size})`,
marginTop: 40,
marginLeft: 50,
marginBottom: 40,
style: { fontSize: "14px", overflow: "visible" },
color: {
domain: colorDomain,
range: colorRange,
legend: false
},
marks: [
Plot.rectY(bins, {
x1: "x0", x2: "x1",
y: d => d.length / (n * (d.x1 - d.x0)),
fill: "#444444", fillOpacity: 0.3, tip: true,
title: d => `Count: ${d.length}\nDensity: ${(d.length / (n * (d.x1 - d.x0))).toFixed(3)}`
}),
Plot.line(normalCurve, {x: "x", y: "y", stroke: () => "Normal Curve", strokeWidth: 3}),
Plot.ruleX([d3.mean(sample_means)], {stroke: () => "Sampling Distribution Mean", strokeWidth: 3}),
Plot.ruleX([popMean], {stroke: () => "Population Mean", strokeDasharray: "4,4", strokeWidth: 2, opacity: 0.8})
],
// NEW: Apply the fixed population domain here
// x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits },
x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits2 },
y: { label: "Density" }
});
// --- D. LEGEND & LAYOUT ---
const myLegend = Plot.legend({
color: { domain: colorDomain, range: colorRange },
style: { marginBottom: "10px", fontSize: "14px" }
});
return html`
<div style="display: flex; flex-direction: column; align-items: center;">
<div style="width: 100%; max-width: 920px; display: flex; justify-content: flex-end;">
${myLegend}
</div>
<div style="display: flex; flex-wrap: wrap; gap: 20px; justify-content: center;">
<div>${popPlot}</div>
<div>${sampPlot}</div>
</div>
</div>`;
}Central Limit Theorem: answers this question
As the sample size \(n\) increases, the distribution of the sample mean \(\bar{X}_n\) approaches a Normal distribution with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\)
\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]
data2 = FileAttachment("work_df.csv").csv({ typed: true })
sample_values2 = [2, 10, 25, 50, 75, 100, 200, 500]
viewof sample_index2 = Inputs.range(
[0, sample_values2.length - 1], // Slide from index 0 to 7
{
value: 0, // Index 3 corresponds to value "50" (your previous default)
step: 1,
label: "Sample size:",
// This makes the label show "50" instead of "3"
format: i => sample_values2[i]
}
)
sample_size2 = sample_values2[sample_index2]
sample_means2 = {
// Dependency tracking: re-run when these change
sample_size2;
// Safety check: ensure data is loaded
if (!data2 || data2.length === 0) return [];
// Safety check: ensure column exists (Case Sensitive!)
if (typeof data2[0].hrs_worked === "undefined") {
return ["Error: Column 'hrs_worked' not found in data"];
}
const means = [];
// Create a local copy of the array to shuffle
// We do this inside the loop or use a specific shuffler to ensure
// we are sampling from the full population every time.
const popArray = Array.from(data2);
for (let i = 0; i < 500; i++) {
// 1. Shuffle a clean copy of the population
// 2. Slice the first n elements (Sampling WITHOUT replacement)
const sample = d3.shuffle([...popArray]).slice(0, sample_size2);
// 3. Compute mean
const m = d3.mean(sample, d => d.hrs_worked);
means.push(m);
}
return means;
}
// 5. CALCULATE NORMAL CURVE (THEORETICAL)
// We now generate the curve based on the Population Parameters, not the sample.
// This shows the Theoretical prediction of the CLT.
normalCurve2 = {
if (sample_means2.length < 2 || !data2) return [];
// 1. Get Population Statistics
const popMean = d3.mean(data2, d => d.hrs_worked);
const popSD = d3.deviation(data2, d => d.hrs_worked);
// 2. Calculate Standard Error (The predicted width of the sampling distribution)
// Formula: Population SD / sqrt(n)
const se = popSD / Math.sqrt(sample_size2);
// 3. Determine range for the curve (centered on popMean)
const minX = popMean - (4 * se);
const maxX = popMean + (4 * se);
// Create a smooth grid of X values
const grid = d3.range(minX, maxX, (maxX - minX) / 200);
// Normal Density Function using Population Mean and Standard Error
const pdf = (x) => (1 / (se * Math.sqrt(2 * Math.PI))) * Math.exp(-0.5 * Math.pow((x - popMean) / se, 2));
return grid.map(x => ({x: x, y: pdf(x)}));
}
// 6. PLOT
// 6. SIDE-BY-SIDE PLOTS WITH EXTERNAL LEGEND
{
// --- A. SETUP VARIABLES ---
const popMean = d3.mean(data2, d => d.hrs_worked);
const n = sample_means2.length;
const bins = d3.bin().thresholds(20)(sample_means2);
// NEW: Calculate fixed x-axis limits based on the Population Data
// This ensures the axis stays constant regardless of sample size
const xLimits = d3.extent(data2, d => d.hrs_worked);
// Define colors
const colorDomain = ["Sampling Distribution Mean", "Population Mean", "Normal Curve"];
const colorRange = ["black", "#E69F00", "#D55E00"];
// --- B. POPULATION PLOT (Left) ---
const popPlot = Plot.plot({
width: 550,
height: 500,
caption: "Population Distribution (Raw Data)",
marginTop: 40,
marginLeft: 50,
marginBottom: 40,
style: { fontSize: "14px", overflow: "visible" },
marks: [
Plot.rectY(data2, Plot.binX({y: "count"}, {x: "hrs_worked", fill: "#444444", fillOpacity: 0.5})),
],
// We explicitly set the domain here too, just to be safe and consistent
x: { label: "Hours worked", domain: xLimits },
y: { label: "Count" }
});
// --- C. SAMPLING DISTRIBUTION PLOT (Right) ---
const sampPlot = Plot.plot({
width: 550,
height: 500,
caption: `Sampling Distribution (n=${sample_size2})`,
marginTop: 40,
marginLeft: 50,
marginBottom: 40,
style: { fontSize: "14px", overflow: "visible" },
color: {
domain: colorDomain,
range: colorRange,
legend: false
},
marks: [
Plot.rectY(bins, {
x1: "x0", x2: "x1",
y: d => d.length / (n * (d.x1 - d.x0)),
fill: "#444444", fillOpacity: 0.3, tip: true,
title: d => `Count: ${d.length}\nDensity: ${(d.length / (n * (d.x1 - d.x0))).toFixed(3)}`
}),
Plot.line(normalCurve2, {x: "x", y: "y", stroke: () => "Normal Curve", strokeWidth: 3}),
Plot.ruleX([d3.mean(sample_means2)], {stroke: () => "Sampling Distribution Mean", strokeWidth: 3}),
Plot.ruleX([popMean], {stroke: () => "Population Mean", strokeDasharray: "4,4", strokeWidth: 2, opacity: 0.8})
],
// NEW: Apply the fixed population domain here
x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits },
y: { label: "Density" }
});
// --- D. LEGEND & LAYOUT ---
const myLegend = Plot.legend({
color: { domain: colorDomain, range: colorRange },
style: { marginBottom: "10px", fontSize: "14px" }
});
return html`
<div style="display: flex; flex-direction: column; align-items: center;">
<div style="width: 100%; max-width: 920px; display: flex; justify-content: flex-end;">
${myLegend}
</div>
<div style="display: flex; flex-wrap: wrap; gap: 20px; justify-content: center;">
<div>${popPlot}</div>
<div>${sampPlot}</div>
</div>
</div>`;
}stock = FileAttachment("stocks.csv").csv({ typed: true })
<!-- viewof sample_size_st = Inputs.range( -->
<!-- [2, 100], -->
<!-- {value: 2, step: 1, label: "Sample size:"} -->
<!-- ) -->
sample_values3 = [2, 10, 25, 50, 75, 100, 200, 500]
viewof sample_index3 = Inputs.range(
[0, sample_values3.length - 1], // Slide from index 0 to 7
{
value: 0, // Index 3 corresponds to value "50" (your previous default)
step: 1,
label: "Sample size:",
// This makes the label show "50" instead of "3"
format: i => sample_values3[i]
}
)
sample_size_st = sample_values3[sample_index3]
sample_means_stock = {
// Dependency tracking: re-run when these change
sample_size_st;
// Safety check: ensure stock is loaded
if (!stock || stock.length === 0) return [];
// Safety check: ensure column exists (Case Sensitive!)
if (typeof stock[0].sp500_volume === "undefined") {
return ["Error: Column 'sp500_volume' not found in stock"];
}
const means = [];
// Create a local copy of the array to shuffle
// We do this inside the loop or use a specific shuffler to ensure
// we are sampling from the full population every time.
const popArray = Array.from(stock);
for (let i = 0; i < 500; i++) {
// 1. Shuffle a clean copy of the population
// 2. Slice the first n elements (Sampling WITHOUT replacement)
const sample = d3.shuffle([...popArray]).slice(0, sample_size_st);
// 3. Compute mean
const m = d3.mean(sample, d => d.sp500_volume);
means.push(m);
}
return means;
}
// 5. CALCULATE NORMAL CURVE (THEORETICAL)
// We now generate the curve based on the Population Parameters, not the sample.
// This shows the Theoretical prediction of the CLT.
normalCurve_stock = {
if (sample_means_stock.length < 2 || !stock) return [];
// 1. Get Population Statistics
const popMean = d3.mean(stock, d => d.sp500_volume);
const popSD = d3.deviation(stock, d => d.sp500_volume);
// 2. Calculate Standard Error (The predicted width of the sampling distribution)
// Formula: Population SD / sqrt(n)
const se = popSD / Math.sqrt(sample_size_st);
// 3. Determine range for the curve (centered on popMean)
const minX = popMean - (4 * se);
const maxX = popMean + (4 * se);
// Create a smooth grid of X values
const grid = d3.range(minX, maxX, (maxX - minX) / 200);
// Normal Density Function using Population Mean and Standard Error
const pdf = (x) => (1 / (se * Math.sqrt(2 * Math.PI))) * Math.exp(-0.5 * Math.pow((x - popMean) / se, 2));
return grid.map(x => ({x: x, y: pdf(x)}));
}
// 6. PLOT
// 6. SIDE-BY-SIDE PLOTS WITH EXTERNAL LEGEND
{
// --- A. SETUP VARIABLES ---
const popMean = d3.mean(stock, d => d.sp500_volume);
const n = sample_means_stock.length;
const bins = d3.bin().thresholds(20)(sample_means_stock);
// NEW: Calculate fixed x-axis limits based on the Population stock
// This ensures the axis stays constant regardless of sample size
const xLimits = d3.extent(stock, d => d.sp500_volume);
// Define colors
const colorDomain = ["Sampling Distribution Mean", "Population Mean", "Normal Curve"];
const colorRange = ["black", "#E69F00", "#D55E00"];
// --- B. POPULATION PLOT (Left) ---
const popPlot = Plot.plot({
width: 550,
height: 500,
caption: "Population Distribution (Raw Data)",
marginTop: 40,
marginLeft: 50,
marginBottom: 40,
style: { fontSize: "14px", overflow: "visible" },
marks: [
Plot.rectY(stock, Plot.binX({y: "count"}, {x: "sp500_volume", fill: "#444444", fillOpacity: 0.5})),
],
// We explicitly set the domain here too, just to be safe and consistent
x: { label: "S&P 500 Volume (x10^6)", domain: xLimits },
y: { label: "Count" }
});
// --- C. SAMPLING DISTRIBUTION PLOT (Right) ---
const sampPlot = Plot.plot({
width: 550,
height: 500,
caption: `Sampling Distribution (n=${sample_size_st})`,
marginTop: 40,
marginLeft: 50,
marginBottom: 40,
style: { fontSize: "14px", overflow: "visible" },
color: {
domain: colorDomain,
range: colorRange,
legend: false
},
marks: [
Plot.rectY(bins, {
x1: "x0", x2: "x1",
y: d => d.length / (n * (d.x1 - d.x0)),
fill: "#444444", fillOpacity: 0.3, tip: true,
title: d => `Count: ${d.length}\nDensity: ${(d.length / (n * (d.x1 - d.x0))).toFixed(3)}`
}),
Plot.line(normalCurve_stock, {x: "x", y: "y", stroke: () => "Normal Curve", strokeWidth: 3}),
Plot.ruleX([d3.mean(sample_means_stock)], {stroke: () => "Sampling Distribution Mean", strokeWidth: 3}),
Plot.ruleX([popMean], {stroke: () => "Population Mean", strokeDasharray: "4,4", strokeWidth: 2, opacity: 0.8})
],
// NEW: Apply the fixed population domain here
x: { label: "Sample Mean", tickFormat: ".1f", domain: xLimits },
y: { label: "Density" }
});
// --- D. LEGEND & LAYOUT ---
const myLegend = Plot.legend({
color: { domain: colorDomain, range: colorRange },
style: { marginBottom: "10px", fontSize: "14px" }
});
return html`
<div style="display: flex; flex-direction: column; align-items: center;">
<div style="width: 100%; max-width: 920px; display: flex; justify-content: flex-end;">
${myLegend}
</div>
<div style="display: flex; flex-wrap: wrap; gap: 20px; justify-content: center;">
<div>${popPlot}</div>
<div>${sampPlot}</div>
</div>
</div>`;
}Other helpful interactive demonstrations: Stats calculators, Seeing theory
De Moivre and (later) Laplace connect outcomes of coin flips and the Normal distribution
Quincunx Machine (Galton Board)
Play more with the simulation here
Let \(X_1, X_2, \dots, X_n\) be a sequence of independent and identically distributed (i.i.d) random variables with: \[E[X_i] = \mu \quad \text{and} \quad 0 < Var(X_i) = \sigma^2 < \infty\]
Let \(\bar{X}_n\) be the sample mean: \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i\)
As the sample size \(n\) approaches infinity, the distribution of the sample mean converges to a Normal distribution:
\[\bar{X}_n \xrightarrow{d} N\left(\mu, \frac{\sigma^2}{n}\right)\]
tinyurl.com/koffclt25