Review: Population, sample, and sampling distributions

24
Cohen Empirical Methods CS650 Review: Population, sample, and sampling distributions 0 1 A population with mean μ and standard deviation σ For instance, μ = 0, σ = 1 Sample 1, N=30 Sample 2, N=30 Sample 100000000000 InterquartileRange = 1.25 InterquartileRange = 1.7 InterquartileRange = 0.65 The sampling distribution of the interquartile range for samples of size N = 30

Transcript of Review: Population, sample, and sampling distributions

Page 1: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Review: Population, sample, and sampling distributions

0 1

A population with mean µ andstandard deviation σ

For instance, µ = 0, σ = 1

Sample 1, N=30 Sample 2, N=30 Sample 100000000000

InterquartileRange = 1.25 InterquartileRange = 1.7 InterquartileRange = 0.65

The sampling distribution ofthe interquartile range forsamples of size N = 30

Page 2: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

How’s your IQ?

Ø Suppose the population IQ is a normal distribution withmean 100 and standard deviation 20.

Ø The mean IQ in this class, 23 students, is 130.

Ø Should we reject the null hypothesis that this class is nodifferent in IQ from the population?

Page 3: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

The Logic

Ø Our result: R = 130Ø Assume Ho: Π = 100, this class is a random sample

drawn from the population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=130 | Π =

100) ≤ α, then we are inclined to reject Ho.

Ø Pick a value of α (say, .01) and calculate the conditionalprobability p = Pr(R=130 | Π = 100)

Ø Our residual uncertainty that Ho might be right is lessthan or equal to α

Page 4: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Calculate p = Pr(R=130 | Π = 100)Find the sampling distribution of R for N = 23

Ø Since we know the population parameters (normal, mean = 100,standard deviation = 20) we can get the sampling distribution byMonte Carlo sampling:

10

20

30

90 10095 110105

(defun sampling-distribution (n mean std k) "N is the sample size, MEAN and STD are the parametersof a normal distribution, K is the size (number of samples)of the sampling distribution." (loop repeat k collect

(mean (sample-normal-to-list mean std n))))

Page 5: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Calculate p = Pr(R=130 | Π = 100)Find the sampling distribution of R for N = 23

Ø Since we know the population parameters (normal, mean = 100,standard deviation = 20) we can get the sampling distribution byMonte Carlo sampling:

Ø The probability of getting a sample of size 23 with mean 130 byrandom sampling from a population with mean 100 and standarddeviation 20 is virtually zero.

10

20

30

90 10095 110105 130

Page 6: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Another way to write the code:

(defun sampling-distribution (n mean std r k) (loop repeat k counting (> (mean (sample-normal-to-list mean std n)) r)))

(sampling-distribution 23 100 20 130 1000)=> 0

Page 7: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Parametric statistical inference

Ø Testing hypotheses by simulating the process ofsampling is cool but not always necessary

Ø The probability of tossing 15 heads in 20 with a fair coincan be worked out exactly

Ø The probability that a sample from a population has aparticular mean can be estimated

Ø However, theory tells us about the sampling distributionsof very few statistics; for the rest, simulation works great

Page 8: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Central Limit Theorem

Ø The sampling distribution of the mean of samples of sizeN drawn from a population with mean µ and standarddeviation σ approaches a normal distribution with mean

µ and standard deviation σ / √N as N becomes large

Ø Good news! We know the sampling distribution of themean and can estimate the probability of sample results!

Page 9: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

The Logic

Ø Our result: R = 130Ø Assume Ho: Π = 100, this class is a random sample drawn from the

population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=130 | Π = 100) ≤ α,

then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability

p = Pr(R=130 | Π = 100)

Ø The sampling distribution of the mean approaches a normaldistribution with mean = 100 and std = 20 / √ 23 = 4.17

Ø So our sample result is 30 / 4.17 = 7.2 standard deviations abovethe mean of the sampling distribution!

Page 10: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Standard error: The standard deviation of the samplingdistribution

100 104

Standard Error of the Mean under Ho: Π = 100, the samplingdistribution is normal, its mean is 100, itsstandard deviation is 20 / √ 23 = 4.17

The standard error is 4.17

The sample result is 4.17 standard errorunits above the mean under Ho

130

99% of a normal distribution lies within two standard deviations of the mean.How probable is our sample result?

Page 11: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Try it again with a less extreme result

Ø Our result: R = 108Ø Assume Ho: Π = 100, this class is a random sample drawn from the

population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=108 | Π = 100) ≤ α,

then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability

p = Pr(R=108 | Π = 100)

Ø The sampling distribution of the mean approaches a normaldistribution with mean = 100 and std = 20 / √ 23 = 4.17

Ø So our sample result is 8 / 4.17 = 1.92 standard errors above themean of the sampling distribution.

Page 12: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

p values

100 104

s.e.

under Ho: Π = 100, the samplingdistribution is normal, its mean is 100,its standard deviation is 20 / √ 23 =4.17

The sample result, R=108, is 1.92standard error units above the meanunder Ho.

108

Now it isn’t so obvious that we should reject Ho.

How can we find p = Pr(R=108 | Π = 100) ?

State the result in standard error units and look up its probability in a table.

Page 13: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

p values

100 104

s.e.

The sample result, R=108, is 1.92standard error units above the meanunder Ho.

108

Page 14: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Standardizing – subtract the mean, divide by thestandard error

100 104

s.e. under Ho: Π = 100, the sampling distribution isnormal, its mean is 100, its standard deviation is20 / √ 23 = 4.17, and the sample result is 108

108

0 1

s.e.

1.92

under Ho: Π = 0, the sampling distribution isnormal, its mean is 0, its standard deviation is 1.0,the sample result is (108 - 100) / (20 / √ 23) = 1.92

Page 15: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Z scores or standard scores – subtract the mean, divideby the standard error

100 104

s.e.

108

0 1

s.e.

1.92

108 - 100

4.17= 1.92

x – µs.e.

Z = x – µ

σ / √ N=

Page 16: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

The Z test

Z is the number of standard errorunits the sample mean is from themean of the sampling distributionunder the null hypothesis.

If Z ≥ 1.645 then the sample resulthas p ≤ .05 probability given the nullhypothesis

If Z ≥ 1.96 then the sample resulthas p ≤ .01 probability given the nullhypothesis

Z = x – µ

s.e.

1.92 = __8 – 100

20 / √ 23

Page 17: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

The Z test

Ø Our result: R = 108Ø Assume Ho: Π = 100, this class is a random sample drawn from the

population of people with mean IQ 100Ø If the result is very unlikely under Ho, if Pr(R=108 | Π = 100) ≤ α,

then we are inclined to reject Ho.Ø Pick a value of α (say, .01) and calculate the conditional probability

p = Pr(R=108 | Π = 100)Ø The sampling distribution of the mean approaches a normal

distribution with mean = 100 and std = 20 / √ 23 = 4.17Ø So our sample result is 8 / 4.17 = 1.92 standard errors above the

mean of the sampling distributionØ Equivalently, Z = (108 - 100) / 4.17 = 1.92Ø p = Pr(R=108 | Π = 100) = Pr(Z) ≤ .0274,Ø α = .01, do not reject Ho.

Page 18: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

You do it:

Ø A sample of size 25 has mean 8. Test the hypothesisthat the sample is drawn from a population with mean12, standard deviation 10.

Page 19: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

You do it:

Ø A sample of size 25 has mean 8. Test the hypothesisthat the sample is drawn from a population with mean12, standard deviation 10.

Z =8 - 12

10 / √ 25= – 2

Page 20: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Central limit theorem demo

VAR

100

200

300

400

500

600

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

Histogram OF Var[Dataset-3]

Population

10

20

30

40

50

60

-1.1

102030405060708090

-1.4

50

100

-1.5 -1 0-0.5

N=20Std = .25

N=30Std = .21

N=50Std = .16

(loop repeat 1000 collect (mean (sample-from-population n))))

Std(population) = 1.11

s.e.(20) = 1.11 / √ 20 = .248

s.e.(30) = 1.11 / √ 30 = .203

s.e.(50) = 1.11 / √ 20 = .157

Page 21: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Three components of all test statistics

Z = x −x

= x −

N

Effect size

backgroundvariance

sample size

You can make any Z score significant with a big enough sample, butyou shouldn’t. Always try to control variance before increasing N.

Page 22: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Parametric and computer-intensive hypothesis testing

100 104

std under Ho: Π = 100, the mean ofsampling distribution is 100, thestandard deviation is 20 / √ 23 = 4.17

130

10

20

30

90 10095 110105 130

Empirically (by simulation) thisdistribution has a mean of 100.05and a standard deviation of 4.38

Page 23: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

We do not know the sampling distribution of moststatistics – but we can estimate them empirically!

(defun sampling-distribution (n mean std k) "N is the sample size, MEAN and STD are the parametersof a normal distribution, K is the size (number of samples)of the sampling distribution."

(loop repeat k collect (mean (sample-normal-to-list mean std n))))

median interquartile-range trimmed-mean median-divided-by-mom’s-age

Page 24: Review: Population, sample, and sampling distributions

Cohen Empirical Methods CS650

Some issues for parametric and computer-intensive tests

Ø Z is fine if you know σ, (recall, z = (x - µ ) / (σ / √ n)) butwhat if you don’t? Estimate σ from s and for smallersamples run t tests.

Ø Monte Carlo tests are fine if you know the parameters ofthe population from which samples are drawn, but whatif you don’t? Estimate these parameters from thesample and run bootstrap or randomization tests.