[SOUND] So this is indeed a general idea of the Expectation-Maximization, or EM, Algorithm. So in all the EM algorithms we introduce a hidden variable to help us solve the problem more easily. In our case the hidden variable is a binary variable for each occurrence of a word. And this binary variable would indicate whether the word has been generated from 0 sub d or 0 sub p. And here we show some possible values of these variables. For example, for the it's from background, the z value is one. And text on the other hand. Is from the topic then it's zero for z, etc. Now, of course, we don't observe these z values, we just imagine they're all such. Values of z attaching to other words. And that's why we call these hidden variables. Now, the idea that we talked about before for predicting the word distribution that has been used when we generate the word is it a predictor, the value of this hidden variable? And, so, the EM algorithm then, would work as follows. First, we'll initialize all the parameters with random values. In our case, the parameters are mainly the probability. of a word, given by theta sub d. So this is an initial addition stage. These initialized values would allow us to use base roll to take a guess of these z values, so we'd guess these values. We can't say for sure whether textt is from background or not. But we can have our guess. This is given by this formula. It's called an E-step. And so the algorithm would then try to use the E-step to guess these z values. After that, it would then invoke another that's called M-step. In this step we simply take advantage of the inferred z values and then just group words that are in the same distribution like these from that ground including this as well. We can then normalize the count to estimate the probabilities or to revise our estimate of the parameters. So let me also illustrate that we can group the words that are believed to have come from zero sub d, and that's text, mining algorithm, for example, and clustering. And we group them together to help us re-estimate the parameters that we're interested in. So these will help us estimate these parameters. Note that before we just set these parameter values randomly. But with this guess, we will have somewhat improved estimate of this. Of course, we don't know exactly whether it's zero or one. So we're not going to really do the split in a hard way. But rather we're going to do a softer split. And this is what happened here. So we're going to adjust the count by the probability that would believe this word has been generated by using the theta sub d. And you can see this, where does this come from? Well, this has come from here, right? From the E-step. So the EM Algorithm would iteratively improve uur initial estimate of parameters by using E-step first and then M-step. The E-step is to augment the data with additional information, like z. And the M-step is to take advantage of the additional information to separate the data. To split the data accounts and then collect the right data accounts to re-estimate our parameter. And then once we have a new generation of parameter, we're going to repeat this. We are going the E-step again. To improve our estimate of the hidden variables. And then that would lead to another generation of re-estimated parameters. For the word distribution that we are interested in. Okay, so, as I said, the bridge between the two is really the variable z, hidden variable, which indicates how likely this water is from the top water distribution, theta sub p. So, this slide has a lot of content and you may need to. Pause the reader to digest it. But this basically captures the essence of EM Algorithm. Start with initial values that are often random themself. And then we invoke E-step followed by M-step to get an improved setting of parameters. And then we repeated this, so this a Hill-Climbing algorithm that would gradually improve the estimate of parameters. As I will explain later there is some guarantee for reaching a local maximum of the log-likelihood function. So lets take a look at the computation for a specific case, so these formulas are the EM. Formulas that you see before, and you can also see there are superscripts, here, like here, n, to indicate the generation of parameters. Like here for example we have n plus one. That means we have improved. From here to here we have an improvement. So in this setting we have assumed the two numerals have equal probabilities and the background model is null. So what are the relevance of the statistics? Well these are the word counts. So assume we have just four words, and their counts are like this. And this is our background model that assigns high probabilities to common words like the. And in the first iteration, you can picture what will happen. Well first we initialize all the values. So here, this probability that we're interested in is normalized into a uniform distribution of all the words. And then the E-step would give us a guess of the distribution that has been used. That will generate each word. We can see we have different probabilities for different words. Why? Well, that's because these words have different probabilities in the background. So even though the two distributions are equally likely. And then our initial audition say uniform distribution because of the difference in the background of the distribution, we have different guess the probability. So these words are believed to be more likely from the topic. These on the other hand are less likely. Probably from background. So once we have these z values, we know in the M-step these probabilities will be used to adjust the counts. So four must be multiplied by this 0.33 in order to get the allocated accounts toward the topic. And this is done by this multiplication. Note that if our guess says this is 100% If this is one point zero, then we just get the full count of this word for this topic. In general it's not going to be one point zero. So we're just going to get some percentage of this counts toward this topic. Then we simply normalize these counts to have a new generation of parameters estimate. So you can see, compare this with the older one, which is here. So compare this with this one and we'll see the probability is different. Not only that, we also see some words that are believed to have come from the topic will have a higher probability. Like this one, text. And of course, this new generation of parameters would allow us to further adjust the inferred latent variable or hidden variable values. So we have a new generation of values, because of the E-step based on the new generation of parameters. And these new inferred values of Zs will give us then another generation of the estimate of probabilities of the word. And so on and so forth so this is what would actually happen when we compute these probabilities using the EM Algorithm. As you can see in the last row where we show the log-likelihood, and the likelihood is increasing as we do the iteration. And note that these log-likelihood is negative because the probability is between 0 and 1 when you take a logarithm, it becomes a negative value. Now what's also interesting is, you'll note the last column. And these are the inverted word split. And these are the probabilities that a word is believed to have come from one distribution, in this case the topical distribution, all right. And you might wonder whether this would be also useful. Because our main goal is to estimate these word distributions. So this is our primary goal. We hope to have a more discriminative order of distribution. But the last column is also bi-product. This also can actually be very useful. You can think about that. We want to use, is to for example is to estimate to what extent this document has covered background words. And this, when we add this up or take the average we will kind of know to what extent it has covered background versus content was that are not explained well by the background. [MUSIC]