[SOUND] We can compute this maximum estimate by using the EM algorithm. So in the e step, we now have to introduce more hidden variables because we have more topics, so our hidden variable z now, which is a topic indicator can take more than two values. So specifically will take a k plus one values, with b in the noting the background. And once locate, to denote other k topics, right. So, now the e step, as you can recall is your augmented data, and by predicting the values of the hidden variable. So we're going to predict for a word, whether the word has come from one of these k plus one distributions. This equation allows us to predict the probability that the word w in document d is generated from topic zero sub j. And the bottom one is the predicted probability that this word has been generated from the background. Note that we use document d here to index the word. Why? Because whether a word is from a particular topic actually depends on the document. Can you see why? Well, it's through the pi's. The pi's are tied to each document. Each document can have potentially different pi's, right. The pi's will then affect our prediction. So, the pi's are here. And this depends on the document. And that might give a different guess for a word in different documents, and that's desirable. In both cases we are using the Baye's Rule, as I explained, basically assessing the likelihood of generating word from each of this division and there's normalize. What about the m step? Well, we may recall the m step is we take advantage of the inferred z values. To split the counts. And then collected the right counts to re-estimate the parameters. So in this case, we can re-estimate our coverage of probability. And this is re-estimated based on collecting all the words in the document. And that's why we have the count of the word in document. And sum over all the words. And then we're going to look at to what extent this word belongs to the topic theta sub j. And this part is our guess from each step. This tells us how likely this word is actually from theta sub j. And when we multiply them together, we get the discounted count that's located for topic theta sub j. And when we normalize this over all the topics, we get the distribution of all the topics to indicate the coverage. And similarly, the bottom one is the estimated probability of word for a topic. And in this case we are using exact the same count, you can see this is the same discounted account, ] it tells us to what extend we should allocate this word [INAUDIBLE] but then normalization is different. Because in this case we are interested in the word distribution, so we simply normalize this over all the words. This is different, in contrast here we normalize the amount all the topics. It would be useful to take a comparison between the two. This give us different distributions. And these tells us how to improve the parameters. And as I just explained, in both the formula is we have a maximum estimate based on allocated word counts [INAUDIBLE]. Now this phenomena is actually general phenomena in all the EM algorithms. In the m-step, you general with the computer expect an account of the event based on the e-step result, and then you just and then count to four, particular normalize it, typically. So, in terms of computation of this EM algorithm, we can actually just keep accounting various events and then normalize them. And when we thinking this way, we also have a more concise way of presenting the EM Algorithm. It actually helps us better understand the formulas. So I'm going to go over this in some detail. So as a algorithm we first initialize all the unknown perimeters randomly, all right. So, in our case, we are interested in all of those coverage perimeters, pi's and awarded distributions [INAUDIBLE], and we just randomly normalize them. This is the initialization step and then we will repeat until likelihood converges. Now how do we know whether likelihood converges? We can do compute likelihood at each step and compare the current likelihood with the previous likelihood. If it doesn't change much and we're going to say it stopped, right. So, in each step we're going to do e-step and m-step. In the e-step we're going to do augment the data by predicting the hidden variables. In this case, the hidden variable, z sub d, w, indicates whether the word w in d is from a topic or background. And if it's from a topic, which topic. So if you look at the e-step formulas, essentially we're actually normalizing these counts, sorry, these probabilities of observing the word from each distribution. So you can see, basically the prediction of word from topic zero sub j is based on the probability of selecting that theta sub j as a word distribution to generate the word. Multiply by the probability of observing the word from that distribution. And I said it's proportional to this because in the implementation of EM algorithm you can keep counter for this quantity, and in the end it just normalizes it. So the normalization here is over all the topics and then you would get a probability. Now, in the m-step, we do the same, and we are going to collect these. Allocated account for each topic. And we split words among the topics. And then we're going to normalize them in different ways to obtain the real estimate. So for example, we can normalize among all the topics to get the re-estimate of pi, the coverage. Or we can re-normalize based on all the words. And that would give us a word distribution. So it's useful to think algorithm in this way because when implemented, you can just use variables, but keep track of these quantities in each case. And then you just normalize these variables to make them distribution. Now I did not put the constraint for this one. And I intentionally leave this as an exercise for you. And you can see, what's the normalizer for this one? It's of a slightly different form but it's essentially the same as the one that you have seen here in this one. So in general in the envisioning of EM algorithms you will see you accumulate the counts, various counts and then you normalize them. So to summarize, we introduced the PLSA model. Which is a mixture model with k unigram language models representing k topics. And we also added a pre-determined background language model to help discover discriminative topics, because this background language model can help attract the common terms. And we select the maximum estimate that we cant discover topical knowledge from text data. In this case PLSA allows us to discover two things, one is k worded distributions, each one representing a topic and the other is the proportion of each topic in each document. And such detailed characterization of coverage of topics in documents can enable a lot of photo analysis. For example, we can aggregate the documents in the particular pan period to assess the coverage of a particular topic in a time period. That would allow us to generate the temporal chains of topics. We can also aggregate topics covered in documents associated with a particular author and then we can categorize the topics written by this author, etc. And in addition to this, we can also cluster terms and cluster documents. In fact, each topic can be regarded as a cluster. So we already have the term clusters. In the higher probability, the words can be regarded as belonging to one cluster represented by the topic. Similarly, documents can be clustered in the same way. We can assign a document to the topic cluster that's covered most in the document. So remember, pi's indicate to what extent each topic is covered in the document, we can assign the document to the topical cluster that has the highest pi. And in general there are many useful applications of this technique. [MUSIC]