Okay. So, here we are with Part two on Non-Probability Sampling. Now, we're going to talk about some of those approaches that we might use to make inference about populations when we do in fact have data from a non-probability sample. So, we talked about a couple of potential population inference approaches that we could use when dealing with data from a non-probability sample. The first approach that we're going to talk about is called the pseudo-randomization approach. In this approach, we combine the data that we have from a non-probability sample with another probability sample. So, we put the two sources of data together. The key thing here is that both samples need to collect similar measurements. So, similar variables measured on all the units that we're trying to collect data from. So, in this approach, we would start by finding those common measurements, those common variables in the two datasets from a non-probability sample, and from a probability sample, and then we would stack those two datasets together. We'd literally append them to each other as you see here in the picture. We then estimate the probability of being included in the non-probability sample as a function of auxiliary information that's available in both of the samples. So, we might have additional socio-demographic variables like gender, and race, ethnicity, and age, and education, or maybe some other socio-economic data that are available on the respondents in both of our samples. We use all that information to determine for each individual in the non-probability sample, their probability of being included in the non-probability sample, when considering both datasets simultaneously. We then treat those estimated probabilities of being selected into the non-probability sample as being known for the non-probability sample. Then we use the probability sampling methods that we discussed previously for analysis. So, the way that this works, we have an indicator. If you're in the non-probability sample in the dataset you get a one, if you're in the probability sample you get a zero. Then we use something called logistic regression to estimate the probability of being in that non-probability sample as a function of all these other variables, like age, and gender, and education, race, ethnicity. Once we have those probabilities of being in the non-probability sample, we can then think, what would have been the probability of being in this stack dataset for an individual who showed up in the non-probability sample given all of their features? Their age, their gender, their education, and so forth. Then we pretend that that estimated probability selection is known, and we use methods for probability samples combining these two data sources together. A second approach that we could use is called the calibration approach. In this approach, we compute weights for all of the responding units in our non-probability sample, that allow us to look like a known population, or allow the weighted sample when these weights are applied to mirror a known population. So, for example, suppose in your opt-in survey on the internet or in your convenience sample, 70 percent of the participants are female and 30 percent are male. So, you can see in the picture here, seven of the ten respondents in your non-probability sample are females. But you know in your target population, remember we talked about the notion of carefully defining a target population, we know in your target population, 50 percent are female and 50 percent are male. So, what we do is we develop weights for the non-probability sample, these are weights that would actually be used in the analysis and computing estimates. We will ultimately down-weigh females, because there's too many females in our non-probability sample, and we will upweigh the males. So, we want our non-probability sample after applying the weights to look more like that known population in terms of a 50,50 distribution. This is especially important if this characteristic where we see this kind of imbalance, is correlated with the variable that we're interested in. We ultimately want to make population statements about. If our sample looks more like the population in terms of a characteristic that has a strong correlation with our variables of interest, we'll get closer and closer to making unbiased statements about the population. The limitation therefore, is if the weighting factor, the factor that we're using to develop these weights or make our non-probability sample look more like the population is not actually related to the variable of interest, we're not going to reduce some of the sampling bias that might have come from the non-probability sampling. So, it's important to pick characteristics for developing the weights that are closely related to the variables that you're actually interested in. So, here's an example. Let's return to this idea of Twitter in analyzing the data collected from tweets, to illustrate this notion of a non-probability sample. Suppose we used an API to extract information from several 100,000 tweets. Again big data, lots of information, lots of tweets included in our sample, and from each of these tweets, we have a coding team that goes through those tweets and, determines an indicator of whether there's support expressed for president Trump. This is no easy task. So, this might often involve computer scientists or other coders who could process the information in these 100,000 tweets, and then compute this binary indicator yes or no, of whether that particular tweet indicated support for president Trump. In this scenario, the probability of a tweet being selected cannot be determined. We just grabbed all these tweets where there may have been some mention of president Trump. In some sense, this is a convenient sample of tweets, and we can't determine the probability of any one tweet being selected using this mechanism. We don't have a frame of all possible tweets. The tweets don't have probabilities of selection. We don't randomly select them. We just take whatever is available that indicated some mention of president Trump. The Twitter users furthermore, are not a random sample of our larger population. So, people who join Twitter are generally interested in expressing their ideas or opinions about particular topics, and people can choose to join Twitter. Again, no random selection, no probabilities of joining Twitter. Twitter users are not a purely random representative sample of some larger population. So, we have to keep that in mind as well. So, again, we get lots of data. Several 100,0000 tweets. We might have a very nice algorithm for computing a binary indicator of support for president Trump, but given these features, again lack of probabilities of selection, not random sampling, there is high potential for sampling bias. This non-representative sample of individuals, and we have this general lack of representation. The people who are typing tweets about president Trump may only represent a very unique set of individuals, and this may only capture the people who have very strong opinions about Trump one way or the other. We may be missing the people who don't have strong opinions that would cause them to join Twitter or type tweets about their support for president Trump. So, we have high potential for bias and lack of representation in that non-probability sample. Okay. So, what's next? What are we going to talk about next? Sampling distributions and sampling variance. So, we've hinted at these ideas so far in introducing probability samples and non-probability samples, and talking where data come from. We just reviewed a couple possible approaches to making inference based on non-probability samples. Now, we're going to talk about this notion in more detail of a sampling distribution and sampling variance. How we can estimate the features of these distributions based on only one probability sample. This is a wonderful feature of probability sampling. We're also going to look at some examples of making population inferences based on the type of sample that was selected. Whether it was a probability sample or a non-probability sample. Like we had just talked about where we either estimate the probability of being in a non-probability sample after stacking both a probability and a non-probability sample together, and pretending that the stacked dataset itself is a probability sample or using some kind of weighting approach a calibration approach. We're going to go over examples of making population inferences whether we have a probability sample or a non-probability sample. But it's very important to understand where these types of data come from and what type of sampling mechanism was used to ultimately produce the dataset that we're working with. Finally, we're going to introduce model-based approaches to analyzing data. So, this is another purely statistical approach that is more or less ignorant about the type of sampling mechanism used. As we get further and further into this specialization, we're going to talk more about regression models. But we'll introduce some of these model-based approaches to actually analyzing data. Whether they come from a probability sample or from a non-probability sample, and we briefly introduce one example of a model-based approach in this second part on non-probability sampling. Remember I talked about the idea of using a logistic regression model to estimate those probabilities of being included in a non-probability sample when we stack the non-probability and the probability samples together. That's an example of a model-based or model assisted approach to making inference about the population, and we're going to look at more examples of doing that as we move forward.