Oftentimes, in experiment planning, there are two competing considerations. We want to collect enough data that we can detect important effects, but collecting data can be expensive, and in experiments involving people, there may be some risk to patients. In this video, we focus on the context of a clinical trial, which is a health related experiments where the subjects are people and we work on determining an appropriate sample size, where we can be 80% sure that we would detect any practically important effects of the drug. In other words, we will find the required sample size that will result in a test with 80% power. This 80% might seem arbitrary, at first, but it is indeed a commonly required power for most experiments. Before we delve into the details of calculating the power and sample size required to attain that power, let's quickly review the definition of power and other concepts closely tied to it. When we make a decision on a hypothesis test, one of four things can happen. If the null hypothesis is rejected when it's actually true, we call this a Type 1 error. The probability of a Type 1 error is the significance level of the test, alpha. This is something that we get to set at the beginning of the test. If, on the other hand, the null hypothesis is failed to be rejected when it is indeed true, the right decision is made and the probability of this happening is the complement of the significance level, 1 minus alpha. If the null hypothesis is failed to be rejected but the alternative hypothesis is actually true, we call this a Type 2 error, and the probability of making a Type 2 error is beta which is a little more complicated to calculate. The last scenario, where the null hypothesis is correctly rejected and the probability of this outcome is called the power of the test. This probability is the complement of the Type 2 error rate or 1 minus beta. Therefore, keeping the Type 2 error rate low increases the power, which is a desirable outcome. In a hypothesis test, we obviously want to keep our error rates low, both alpha and beta. However, decreasing one increases the other, and one solution for this problem is getting a larger sample size. Hence, it's important to think about the sample size when designing an experiment. And making sure that resources are invested to recruit a sufficiently large number of subjects to obtain the desired power of the test. Suppose a pharmaceutical company has developed a new drug for lowering blood pressure and they are preparing a clinical trial to test the drug's effectiveness. They recruit people who are taking a particular standard blood pressure medication, and half of the subjects are given the new drug, this is the treatment group. And the other half continued to take their medication through generic-looking pills to ensure blinding, this is our control group. What are the hypotheses for a two-sided hypothesis test in this context? The null hypothesis is going to state that there is no difference in average blood pressure of those in the treatment and control groups. And the alternative is going to state that there is indeed a difference. Note that we use a two-sided alternative hypothesis test often in clinical trials as we would be interested in finding out if the new drug is better or worse than the existing treatment. Suppose researchers would like to run this clinical trial on patients with systolic blood pressures between 140 and 180 millimeter of mercury. Suppose previously published studies suggest that the standard deviation of the patients' blood pressures will be about 12 millimeters of mercury and that the distribution of patients' blood pressures will be approximately symmetric. If we had 100 patients per group, what would be the approximate standard error for difference in sample means of the treatment and control groups? This is a test for comparing two independent means. We calculate the standard error as the sum of the variances of the two groups, that's 12 squared, divided by their respective sample sizes which is 100 for both groups in this case. The standard error comes out to be 1.70 millimeter of mercury. Then, according to the central limit theorem, the distribution of the differences in sample means will be nearly normal, with mean 0, because remember, that's what our null value was. And the standard error is the standard error that we calculated, 1.70. Using this information, we can find out what values of the sample statistic we would need to reject the null hypothesis. Let's start by drawing our null distribution that's nearly normal and centered at 0, the null value with the standard error we calculated earlier. Rejecting the null hypothesis would require having a sample statistic that's sufficiently far from the null value so that the two-sided tail area will be less than 5%. That is, the sample statistic needs to fall in the rejection region shown here. How do we decide where that rejection region falls? Under the normal model, 95% of the observations fall within 1.96 standard deviations of the mean. And since we measure the variability of this distribution by the standard error, the rejection region starts 1.96 times 1.70 standard errors away from the mean. That's 3.332 millimeters of mercury away from the null value. This could be on the positive side or on the negative side of the null value because we have a two-sided alternative test. Suppose that the company researchers care about finding any effect on blood pressure that is 3 millimeters of mercury or larger cersus the standard medication. What is the power of the test that can detect this effect? In other words, 3 millimeters of mercury is the minimum effect size of interest and we want to know how likely we are to detect this size of an effect in this study. If the treatment is indeed effective enough to result in a 3 millimeters of mercury drop in blood pressure on average, then it means the observed distribution of differences in average blood pressures between the two groups will be shifted from the null by 3 millimeters of mercury, as shown in this plot here. We also know that we can only reject the null hypothesis if the observed difference is less than negative 3.332 millimeters of mercury. Putting all of these together, the probability of being able to reject the null hypothesis if the true effect size is negative 3, is equal to the green shaded area under this curve. We've been able to simplify this task of calculating the power, to just calculating an area under the normal curve. We calculate a Z score as the difference in sample means, negative 3.332 minus the mean of that distribution, negative 3 divided by the standard error we calculated earlier. This yields a Z score of negative 0.20 and the shaded green area is approximately 0.4207. Therefore, the power of the test is about 42% when the effect size is negative 3 and each group has a sample size of 100. Obviously, this is much lower than the 80% power we set out to attain at the beginning of this video. It highlights how important it is to not just arbitrarily select a sample size and risk being left with an under powered study. How can we fix things? We can work backwards from the desired power to determine the minimum required sample size instead. Note that the effect size is still negative 3 since that's what the drug company is interested in, however, the standard error will now be different since it changes when the sample size changes. Let's sketch our distributions again and mark the power on the green shaded area. We're working backwards, so we first need to determine the Z score that marks the 80th percentile of the normal curve. The 80th percentile is marked by a Z score of 0.84, therefore the distance between the center of the green distribution and the cutoff for the rejection region is 0.84 times the standard error of this distribution, Ehich is still unknown. We also know that the distance between the center of the null distribution and the rejection region is 1.96 times the standard error for a hypothesis test with 5% significance. Note that we're assuming the standard errors of the null distribution and the distribution of the observed data are the same and this would be true if the drug only lowers the blood pressure but doesn't change its variability. Then the effect size of 3 millimeters of mercury is spent by 0.84 plus 1.96 2.8 standard errors. Now we have a simple problem to solve for one unknown. First, we calculate the standard error as 3 / 2.8 and we're purposefully not going to round too much and then we set this value equal to the sum of the variances of the two groups 12 squared divided by the unknown sample sizes n. Solving for n yields 250.88. So really we need, at least, 251 observations in each group in order to detect an effect size of 3 millimeters mercury. When are these calculations actually used in practice? We can use them when designing a study to calculate a required sample size for a desired level of power. Or we can calculate the power for a range of sample sizes, and choose the target level of the power based on the resources available for collecting the required sample size. The plot here shows the power of the test that we've been working with, calculated for a sample sizes of 20 through 5,000 patients per group. Each data point on this curve is calculated as the power of the test for a given sample size, just like we did earlier in the video but obviously, we didn't calculate all of these powers by hand. We quoted them up in R and calculated them iteratively for sample sizes 20 through 5,000. We can see that as the sample size increases so does power but only up to a point, there seems to be no good reason to recruit more than 500 patients or so for each group since the power plateaus at that point. This is important to know when designing a study in order to avoid wasting resources on a sample size that is larger than needed for the maximum power desired.