Okay. So, welcome back to our continued discussion of probability sampling. So far, we've been talking about simple random samples and some of the nice properties of simple random samples on terms of competing unbiased estimates based on random representative samples of larger populations. Now, not every probability sample is a simple random sample. As I suggested earlier, simple random samples have important drawbacks, namely, they're very expensive for larger populations. When we talk about trying to draw our samples from larger populations, it can be very expensive to draw small samples from larger populations and then measure those individually units. This is where we get into what I call "complex" probability samples. Simple random samples are rarely conducted in practice. The exception is that we might have a relatively cheap data collection based on a well-defined population list or maybe a sample of administrative records where we can literally pull records from a file cabinet or some kind of online filing system. With larger populations as I mentioned, complex samples are often selected again where each sampled unit has a known probability of selection. The difference is that with complex samples, we use very specific features of probability sample design that allow us to save on costs and make our samples more efficient. So, that's what we'll be talking about now. In general, when we refer to complex samples, complex samples are anything that's more complicated than simple random sampling. So, when we use design features for those samples, they deviate from the principles of simple random sampling that we've discussed. Okay. So, complex samples have certain key features. Again, relative to simple random sampling. Remember with simple random sampling, we're taking a random sampling of size little m from a larger population, capital M, and we're just selecting units from a list of the population at random. With complex samples, there are key design features that distinguish these types of samples from simple random samples. First of all, the population is divided into different strata, and part of the sample is allocated to each stratum. What this does is it ensures sample representation from each stratum and reduces the variance of survey estimates. This is a technique that's known as stratification. You can think about it with a simple random sample. Recall that stadium view in the previous lecture, we had a nice representative sample from that whole stadium. But remember, with simple random sampling, every sample is equally likely and when selecting cases at random, all 134 of those cases in that previous example could have come from one quadrant of the stadium. Those cases were selected at random. They're representative, but it just so happens that by random chance, they all come from one quadrant of the stadium. In that case of simple random sampling, we wouldn't have representation from all four quadrants of the stadium like we saw in that previous picture. That's what we call a bad sample. Still a simple random sample, still a probability sample, but not ideal in terms of representation. Stratification allows us to ensure that we're allocating some of our sample to all these different divisions of the population. Complex samples are also defined by cluster sampling. In cluster sampling, we might select clusters of population units for example, counties in the United States at random first, all with known probability so every county would have a known probability of selection within these different strata. So, for example, we might select a certain number of clusters from the Western United States, a certain number of clusters from the Midwestern United States and so forth and so on. What this does in terms of sampling clusters first before we actually randomly select people or households, this saves a lot of money on the data collection. Again, instead of in simple random sampling having to visit each household individually in a random independent sample of households, we instead would visit a larger cluster like a US county. D then within that randomly sampled county, we would then go to a sample of households within that county to collect our data. This again saves on the cost of data collection, okay? Then we would randomly sample units within those clusters like I mentioned, according to some probability of selection. So, once we get to a US county, we might track down a list of addresses within that US county and randomly select households at random from that list of all possible addresses. Again, addresses on that list, they don't have known probabilities of selection. So, we would randomly select those households according to those probabilities of selection just like we talked about with simple random sampling and measure those households, collect the measures that we're interested in. But the key distinction here is that those units, they come from a second stage of sampling within the initial randomly selected clusters. This is the idea of complex sampling. So, in these complex samples, a unit's probability of selection is determined by several things. It's determined by the number of clusters that were sampled from a given stratum. It's determined by the total number of clusters in the population in each of those different strata. It's determined by the number of units that were ultimately sampled from within each of those randomly selected clusters, and it depends on the total number of units in the population in each of those different clusters. So, here's an example of how we might find a unit's probability of selection when using one of these more complex samples. Suppose that we were selecting a complex sample where we selected little a out of capital A clusters at random within a given stratum. So, consider the Midwestern US is one major region of the United States and we're defining that as a stratum. Within the Midwestern US, there might be capital A counties that we could sample from, and we select little a counties from out of those capital A counties at random within the Midwestern US. Then, once we have that random sample of little a counties, within each of those little a counties, we select little b out of capital B units at random from within the selected county. In this case, we can do a little bit of math and write down the probability of selection of each of those units within that particular county. It's little a divided by capital A, that's the probability that a cluster was selected at random assuming that we select a simple random sample of clusters. Then we multiply the first fraction by little b divided by capital B. That's the second stage sampling fraction. So, within each of those little a clusters, we take little b out of capital B possible units from within that particular cluster. So, the people designing samples, they determine what little a should be, what little b should be to satisfy constraints on the sample design and cost constraints. But we can write down these probabilities of selection for each individual in the population still based on these kinds of multi-stage designs where we first select clusters, then we select units within those clusters. So, here's an example to make it even more concrete using the National Health and Nutrition Examination Survey. Let's suppose that we divide the United States into different regions based on geography and population density. So, how dense is the population within a particular area, and where are we talking about geographically in terms of that area. We refer to these divisions of the larger US based on geography and population density as strata. Again, by allocating sample to each of these different strata, we ensure some representation from each of the stratum. We minimized the risk of a bad simple random sample, where when selecting elements at random, maybe they all come from one region in the US. We want to minimize the risk of that happening. Next, we allocate some number of counties or groups of counties to be sampled from each of those different strata. Again, using the terminology that we just introduced, these are clusters. We randomly sample clusters again to save costs. That way, we can sample households within a very small geographic area such as a counter, rather than driving to households that had been randomly sampled from some larger geographic area. Then, we sample certain socio-demographic subgroups of individuals at higher rates within those counties. This is something that's known as oversampling. So, maybe a given project has a certain target sample size for particular subgroups of individuals. We might sample these different subgroups at higher rates within those counties when we're randomly selecting households. What this leads to is different probabilities of selection for different types of individuals depending on the goals of a given project. That oversampling means that different people will have different probabilities of being selected different rates of selection at that second stage, and that's okay. We still have a probability design, and we can use those probabilities to ultimately make representative statements about the larger population. So, here's a picture of what this kind of multistage sampling process might look like, and this is borrowed from the National Health and Nutrition Examination Survey Documentation. So, we start with the larger United States at stage one, and at stage one again, we might divide the US into different regions, and then sample these counties, these clusters from within each of those regions. So, you can see the purple dots in this case. Those are the randomly selected counties at the first stage of random selection. Then, you see we bring one of those counties forward at stage two, and we look at smaller area segments within those randomly selected counties. We might then select these smaller areas segments at random as well. Those are the little yellow selections within larger county. So, this is multiple stages of cluster sampling, and again this would be done to save on costs, and identify households within those randomly selected clusters or counties in a more cost-efficient manner. Rather than just driving the different households at random in an entire US county, sample smaller area segments within that county at the second stage. Then, if we bring forward one of those randomly selected area segments within a county, then we start to see the households within that randomly selected areas segment. We might list all those households, or purchase a commercial list of all those households, and then once we identify those households, select a random sample of those households, maybe using simple random sampling within that particular smaller geographic areas segment. Then, once we have a randomly selected household, we have a field staff member who might go visit that household and identify all the individuals within that households. We bring a randomly sampled household forward. We can see maybe the father, the mother, and the kids within that household, and we might want to select one of those individuals at random within that particular household. At all four of these stages, we know what the probabilities of selection are, and we maintain those probabilities of selection throughout the entire design. We always know what they are for each of the different units at every stage that you see here. That's the important feature of a probability sampling design. Ultimately, we can compute the probabilities of being included for every single individual that we might randomly sample. So, I like this image because it's a good 3D view of everything that we're trying to do in terms of a multistage complex sample. So, what happens in the NHANES? The National Health and Nutrition Examination Survey, they'd drive a huge semi-trailer that contains medical equipment and staff to each of these sampled counties and the smaller area segments within those counties. Then, within those areas segments, when they sample households, they invite the randomly selected people from those households out to this huge semi-trailer for a survey interview and a medical exam. This is where the data come from. So, once we have this complex multi-stage design, we need to collect data from the randomly sampled individuals. Those randomly selected individuals within households, they're invited to come visit the semi-trailer, complete a survey interview with an interviewer, and then go through a medical exam with trained medical professionals, and that's where the data come from for NHANES. In this type of design, the inverse of a person's probability of selection is then what's referred to as their sampling weight. So remember, across all those stages we can compute the probability of being included in the sample. If we just take the inverse of that probability of selection, that you have their sampling weight. So, for example, if my probability of selection according to that very complex multi-stage design, counties, then areas segments within counties, then households within areas segments, then individuals within households. If ultimately, my probability of selection was one divided by 100, that means that my sampling weight is 100. In other words, I represent myself and 99 other people in the larger population. This sampling weight, which is a function of the probability of selection as you can see here, is used in the actual data analysis to compute representative population estimates. So, that probability of selection plays a direct role in the computation of estimates based on a complex sample. So, as I mentioned, those weights get used to compute unbiased estimates of population quantities of interest. Once we collect these data from the randomly selected individuals, so these could be things like body mass index or BMI, and they account directly for these different probabilities of selection if certain groups are oversampled according to the sample design. So, these probabilities of selection, again play a direct and essential role in the computation of unbiased population estimates. So, why probability sampling? Again, just to stress a couple of these key concepts. Every individual in the population has a known nonzero probability of selection and subsequent random sampling. According to these probabilities of selection, ensures that all of the units in the population will have some predefined chance of being sampled according to the probability sampling design. Once we know these probabilities of selection, we can literally write them down for each of the population units. We can use those probabilities of selection to compute unbiased estimates using those sampling weights that I just alluded to, and we can also estimate features of the sampling distribution, that we would ultimately see if we had repeated that complex sampling over, and over, and over again using those probabilities of selection. We can actually simulate what that sampling distribution would look like based on selecting only one sample, and this is part of the beauty of probability sampling. We don't have to draw samples over, and over, and over again to get a sense of what that distribution of estimates might look like, that work by Jerzey Neyman that we talked about earlier in the week, that allows us to make population statements based on only one probability sample. So, probability sampling provides a statistical basis for making inferences about certain quantities in larger populations. This is the very important point about probability sampling. Next up, we're going to learn more about non-probability sampling and some difficulties with that approach despite its popularity, and the fact that it tends to be a lot cheaper than probability sampling in many cases. We're going to learn how to make population statements in that setting. It's not as easy as it is. Within the probability sampling setting, it's possible but probability sampling makes it much easier for people analyzing survey data to make representative population statements.