Okay, so now let's talk about probability sampling in a little bit more detail and see some examples of different types of probability samples. So we'll go in a little bit more detail about this technique and talk about why probability samples, again, are important for making representative statements about populations. So here's a brief overview of this lecture. We're going to start with simple random sampling as one type of probability sampling. We're going to abbreviate that by SRS, and we're going to make links between simple random sampling and the notion of i.i.d data that we talked about in previous lectures for this course. So independent and identically distributed data. And we'll go over an example, a little fictional example talking about email response times for a customer service organization. Then we'll get into what are called complex samples for larger populations. These are samples that have very specific design features, namely stratification, cluster sampling, and weighting. These are still probability samples, they're just more complex than simple random samples. And as an example, we'll talk about a major national health survey called the National Health and Nutrition Examination Survey or NHANES. And again, we'll keep talking about key benefits of probability sampling as we move forward. So let's start with simple random sampling or SRS. With SRS we start with a known list or a sampling frame, recall from the previous lecture, of N population units. So N refers to the size of the population. Again, this could be individuals, households,businesses, establishments, but we have N units in the population, and we have them all listed. And we randomly select n units from the list, so n is the size of our sample. Every single unit following this sample design has an equal probability of selection, and that probability of selection for every single unit on the list is defined by n / N. So that's our first example of a sampling fraction, okay? This is the fraction of the population that ultimately gets selected to be in the sample. And in the case of simple random sampling, that probability of being included in the sample is the same for everybody in the population, it's n / N. What this also means is that all possible samples of size n are equally likely. Okay, so every potential sample of size n that we could draw, they're all equally likely to be selected when using simple random sampling. Furthermore, estimates of means, proportions, totals, and other statistics of interest based on the data that we collect from a sample random sample are what we refer to as unbiased. So we talked a little bit about bias in the last lecture. With simple random sampling as a type of probability sampling, when we compute estimates based on the simple random sample those estimates are unbiased. And what that means is that these estimates are equal to the population values that we're interested in on average. Remember that idea of a sampling distribution that we introduced. There could be variability in these estimates depending on what samples we select, but on average across all those hypothetical samples, these estimates are going to be equal to the population values of interest. That's what we mean by unbiased. So here's just a picture of what a simple random sample of say 134 people would look like if those 134 people were randomly selected from a list of size 10,000. Okay, and this is a stadium view of that random sampling. And you can see all the black dots in this stadium view. Those represent the randomly selected individuals. So you can see that that random sample of size 134 is representative of all different quadrants of that stadium. We have some sample from all those different quadrants. So we have a representative selection from all the different areas of that particular stadium. And again, in that little example, those 134 people were selected entirely at random from a larger list of 10,000 people. So we can make good representative statements about the larger population in that stadium by collecting data on those 134 people that were randomly selected using this technique. Okay, so simple random sampling can be with replacement or without replacement. With replacement means that when we select somebody from a larger list, we've replaced them in that list. And we give them a chance of being selected again in the sample. More common is that simple random sampling is done without replacement. So once an individual unit is sampled from a given list, they can't be sampled again. But for both of these different types of simple random sampling, it turns out that the probability of selection for each unit is still that same quantity, n / N. So whether we're doing with replacement selection from a list, where people could be selected multiple times, or without replacement, the probability of selection still remains the same. Everybody has a little n divided by a capital N chance of being included in the sample. Now, simple random sampling, while it might seem easy to think about, it's rarely used in practice. Now, why is that? Collecting data from n randomly sampled units in a very large population can be prohibitively expensive. For example, consider a simple random sample of 1,000 people from throughout the United States. And suppose that we had to drive to each of those individual households to have somebody interview those randomly selected individuals. It would become incredibly expensive to have someone just drive to these randomly selected individual households. That could become very prohibitably expensive. So we think about alternative types of probability samples that are much less expensive to field and practice. And this is especially true for a large populations. Simple random sampling is generally done when populations are smaller, and it's easier and less expensive to ultimately collect the data. Okay, what about that connection to i.i.d data, recall, is a concept that we talked about a little bit earlier in this course. Recall that i.i.d observations are independent and identically distributed thinking about the distributions that they come from. Simple random sampling will generate i.i.d data for a given variable in theory. So when we select a simple random sample and then we collect measurements on all the units in that sample, in theory that will produce what is i.i.d in terms of the definition that was introduced earlier. All randomly sampled units are going to yield observations that are independent, so there's no connection between the units that are randomly sampled. They're selected independently of each other, or they're not correlated with each other in terms of their measures of interest. And furthermore, those units are identically distributed. Okay, so they're representative of some larger population of values, again, in theory. So all those measures that we're trying to collect from a simple random sample, they arise from an identical distribution that describes the distribution of values in the larger overall population. So we have a representative randomly selected set of units of observations that are independent from each other, so i.i.d data, in terms of a sampling process. So here's an example of simple random sampling, just to make things a little bit more concrete. Suppose that we have a customer service database of N, which, recall, is the size of the population 2,500 email requests that came in in 2018, and the director of the customer service division wants to estimate the average, or the mean, email response time. So how long did it take that customer service division to respond to each of these 2,500 email requests? Now you might look at this and say, well, that seems pretty easy. We could just go into all 2,500 of those emails and see how long it took. Well, unfortunately in this case, the exact calculations require a manual review of each email thread. So going in and tracking down very specific time stamps for when that email was responded to. So it's not that straightforward to just measure every single one of the 2,500 email requests. So the director asks the analytics team to sample, process and then analyze a small sample of n = 100 emails instead of trying to measure every single email. So, a couple of different approaches we could get at this kind of sample. A naive approach would be to simply process the first 100 emails on the list. We have a list of 2,500 emails, let's just take the first 100. In this case, the estimated mean response time could be biased if, for example, the customer service representatives learn or they get better over time at responding more quickly. So maybe at the beginning of these email responses, it took a lot longer to respond while people were still getting the process down. So our estimated mean might be biased to be too large in that situation. The first 100 observations could also come from a small group of staff, so the first set of emails being responded to could've come from very specific individuals. In other words, the email responses were not fully representative. They weren't independent of each other because maybe they were being answered by the same staff members, and they weren't identically distributed because they arose from a different distribution of response times than the distribution for everybody in the entire department. So there's no random selection when following this approach, according to specific probabilities. What we essentially have following this naive approach is a non probability sample, and that provides us with important limitations. A better approach using simple random sampling would be to number those emails from 1 to 2,500, and then randomly select 100 of those numbers, those IDs for each of the emails, using a random number generator. In this case, every email would have a known probability of selection. Remember, it was a sample size 100, divided by the size of the population, 2,500. So following this simple random sampling approach, every email would have this known probability of being selected in the sample. Furthermore, this would produce a random representative sample of 100 emails, again in theory. A random selection from all possible emails that were responded to. The estimated mean response time in this case will also be an unbiased estimate of the population mean, like we talked about with simple random sampling. On average, if we were to draw many, many samples of size 100 and compute the mean of each of those, the average of those means would be equal to the true population value. This is one of the very nice features of probability sampling.