So now that we've talked about where to do your survey and what stratification is and how to stratify, I want to talk about what a cluster survey is, what clusters are, and how to select your clusters. So first I want to take a step back and go back to this question of, what is a sampling frame. We've mentioned sampling frames previously. So again, a sampling frame is just a comprehensive listing of all those within a population who can be sampled. So as we'll talk about, there are different stages of sampling when you're sampling a household survey. At the first stage you're sampling clusters and so there your sampling frame is just a list of all the clusters in the population. Then within clusters you sample households and there your sampling frame is a list of households. If you're sampling individuals, your sampling frame is a list of individuals. Again, including all those who are eligible to be sampled. Having a sampling frame is a basic requirement for probability sampling because that is what allows you to calculate the probability of selection. Without it, you don't know what the probability of selection is, you can't calculate weights, and you will likely have a biased sample. What is cluster sampling? So we use cluster sampling when the survey population is too large to develop a sampling frame of all the households or individuals in the population, right? So going back to my earlier example, if we're in Sumy region, we would not want to develop a sampling frame of all of the households in Sumy region. Instead we would divide Sumy region into clusters and we would develop a sampling frame of those clusters. So clusters are, you will also sometimes hear referred to as primary sampling units. That means basically they are the first things that you sample in multistage sampling. And so when you're doing cluster sampling, the process is essentially you list the clusters in your survey population or in your stratum, in your strata, if you're stratifying in order to obtain a sampling frame of clusters within each stratum. You then sample clusters from the sampling frame. I'm going to go into more detail on each of these steps later. So you sample clusters from the sampling frame. In each sampled cluster, you list the households, which are sometimes called secondary sampling units, to obtain a sampling frame of households. And then in each cluster, you sample households from that household sampling frame to obtain your household, your sample of households. So now let's look at cluster sampling graphically. So you start by stratifying, as we've talked about. So here we're showing the sampling for stratum 2, but you would do this in each stratum. So in stratum 2, we have listed out the clusters, there are seven of them. And then we have sampled four of those clusters, those are the ones that are highlighted in blue. And then in each of those sampled clusters, we develop a sampling frame of households. And from that list of household of all the households in the cluster, we sample 30 households. So we now have a sample of 30 households in each of four clusters, so 120 households total. We then go to each of those households and interviewers visit each of those households and in each household they determine who are eligible individuals in that household, individuals who are eligible to be interviewed. Eligibility, we'll talk about later on, is usually based on age and sex. And again, you want to repeat this process for every stratum that you have in your survey population. So let's talk in a little bit more detail about what a cluster is. A cluster is simply a unit that contains multiple sampling elements. So what do I mean by sampling elements? A sampling element is the thing that you are sampling. So in the case of a household survey, we are sampling households and so the cluster is just a group of households. The important thing is that each household in the survey area must be included in one and only one cluster. If a household is not included in any clusters, then it doesn't have the possibility of being selected. So you're violating one of the requirements of probability sampling. If a household is in more than one cluster, then it has the possibility of being double selected, which is also a problem. For sort of logistical and also statistical reasons, clusters should ideally be approximately similar in size so that your weights don't get too crazy and cluster boundaries should be clearly defined. In lower and middle income countries, we strongly recommend using census enumeration areas as clusters. So what's the census enumeration area? Most countries, almost all countries, do a decennial census, so a census approximately every ten years. And as part of that census, they essentially divide the country into a large number of what they call enumeration areas, which are usually somewhere between 100 and 250 households in each enumeration area. And so these are small enough that it is feasible to develop a list of all the households in the enumeration area. The National Institute of Statistics in each country maintains a list of all of the EAs in the country from the previous population census. They also should have maps of each EAs and estimated population size from the most recent census. If you are planning to use EAs as your clusters, you will want to contact the National Institute of Statistics well in advance of your survey in order to arrange to use these lists and to obtain maps of the sampled EAs. This takes time. Often there's a fee associated with it and so you want to allow sufficient time to do this. So the question often comes up whether I can use villages as clusters. Is that possible? People do use villages a lot for household surveys. We don't really recommend it for a couple of reasons. One of the most important reasons is that village is vary widely in size, right? You can have a village that has 100 households and you could have a village that has 5000 households. It's really not feasible to enumerate all households in large villages and also big differences in village. Therefore, cluster size will lead to a wide range of sampling weights and higher variants, which ultimately will give you less precision in your estimates. Also importantly, the boundaries of villages are often not precisely defined, right? So there may be some disagreement about sort of which households are included in a village or not. And you may end up with a situation where you have a household that's not clearly included in any village, where potentially a household that might be listed in multiple villages. And so this can create problems with your sampling. What is the process for sampling clusters? Again, within each stratum, basically, you want to develop a list of the clusters. If you're using enumeration areas, once you get that list from the National Institute of Statistics, this is fairly easy, right? You have your list and then you sample clusters in each stratum using systematic random sampling with or without probability proportional to population size, okay? So let's talk a little bit more about what is systematic random sampling. So simple random sampling is the type of sampling that people are probably most familiar with, right? And this is literally a random selection. So you take your enumeration areas and you randomly select three of them. So the disadvantage of this is that there's a chance, a small chance but a chance, that you could sample all of your clusters in a particular geographic area, right? So maybe all of the clusters that you sample are in some little corner of the region or they all end up in a more urban area just by chance. With systematic random sampling, before we do the sampling, we order the list of clusters by geographical area. So we might order by district and then within district by whatever the subdistrict is, the commune or the department and so forth. And then we use an interval for sampling. So we start at a random point in the list and then we calculate a sampling interval. And essentially what this does is, it ensures that you have a geographically diverse samples. So there's no chance that your sample of clusters would be sort of all clustered together in one area. To get into more detail about how to do this, again, you order your list of clusters by geographical area. So for example, if I'm working in Burkina Faso, I can sort my list of clusters first by region, then by province, then by department, then commune, then village, right? Those are sort of the administrative levels. I then generate a random starting point and I calculate a sampling interval. So if you're sampling clusters, the sampling interval is the total number of clusters in the stratum, divided by the number of clusters that you want to sample. The sampling interval is calculated differently if you're doing sampling proportional to population size, which we'll talk about more later. You then start at your random starting point and you select every Kth cluster, right? Where k is the sampling interval, until you reach the total number of clusters that you need. And Excel is really helpful. You can automate this in Excel and we'll show you how to do that next. All right, so let's now talk a little bit about sampling clusters with probability proportional to population size or PPS sampling, which you may hear people talking about. So in PPS sampling, the probability of selection at each cluster is proportional to its estimated population size, either the number of households or the number of individuals in that cluster. So what does this mean? Basically larger clusters, clusters with a larger population have a higher probability of being sampled than smaller clusters, right? The advantage of PPS sampling is that it produces what we call a self weighted sample in each stratum. So if you are sampling clusters using PPS and then you are sampling the same number of households in each cluster, the probability of selection of every household in the stratum is the same. I'm not going to get into the formula to show you why that's the case because, but trust me that it is the case. If you sample with clusters without PPS, the probability that a house will be selected and therefore its weight is going to vary widely between clusters because clusters of different size have the same probability of selection. And then, so what that essentially means is that households living in smaller clusters have a higher probability of being selected than households living in larger clusters. So what do I need to conduct PPS sampling? I need estimates of the population size in each cluster. It does not need to be the actual population size, which is difficult to get unless you've been to the cluster very recently. But you need to have a reasonably good idea of how many households or individuals live in each cluster. If you are using census enumeration areas as your clusters, you can get this information from the National Statistics Office. So why do we use PPS sampling? You don't have to use it, but it can be advantageous because first of all, it's analytically simpler, the weights are simpler. All households within a stratum have more or less the same weight, and then also you have fewer different weights. So without getting into too much technical detail, this will essentially result in less variance and more precision in your estimates. So your confidence intervals will be a little bit narrower than if you're not using PPS sampling, and you have sort of a wide range of weights.