Hello, everyone. So today we're going to start talking about the classification. Okay, so this is a game, like a one of the core method, right, in the data mining pipeline. So our learning objective is that we will of course be able to apply the techniques right? For classification and also be able to explain how the work, right? That's very important, right? It's not just about kind of be able to use it, right? You understand how it works and also we want to learn how to evaluate the different measures and also be able to compare the method so that you can choose the right classification method for your specific problem. Now, let's start with like the quick discussion of a supervisor versus unsupervised learning, okay? You may have heard of those terms in many different settings and it is actually important to understand the difference between these two, okay? So with the supervisor learning, okay? So classification being a very good example, right? So this is a case where you have predefined classes, okay? So, you know like how many classes you're talking about, which classes you're talking about, okay? And also you will have training data, so the training data will provide a list of objects, you have your list object attributes, but very importantly you have the corresponding class label, right? So that is a grand, choose the label that you need because otherwise, that you will not be able to know like which objects would go to which classes, right? And that's really our goal in this case, right? So with this training data, you know the sample objects and you know the corresponding class labels, right? So you're trying to construct your model usually referred to as a classified, okay? So this model can then be used right on any new data to then determine the class label for the new data, okay? While in terms of unsupervised learning, okay, so classroom, which will be discussing later is in this scenario is that you don't have any predefined classes, okay. All your being provided at just a list of objects, right, so the objects are described by multiple kind of attributes, right? You have the wrong values but there is no policy predefined classes or relationship ahead of time, okay? So instead our goal is to do this kind of unsupervised a lot just to be able to detect or determine whether they are potential grouping like classes right, on your dataset or not. And if they are of course you go to identify those clusters. So because of those two like quite different, right? So how we work with the different types of data and how we construct our model is very different. So, here, when we talk about the classification, right, you also kind of like no like a very related term which is prediction, okay? So typically when we talk about classification versus prediction, we're basically trying to separate right? The cases where one we have classes so there's are usually finite a number of like category called labels, right? So this could be yes or no right or class one, class two, classes three, okay? And our goal is to be able to then determine like which of those few classes that your particular object would belong to, okay? So think about the application scenarios of a fraud detection, right? So you have kind of credit card transactions, you may have some kind of life insurance claims, right? So basically just trying to determine whether this Is a fraud or not, right? So your typical talking about two classes, yes or no, okay or with disease diagnosis, right. You can see you may have kind of certain information regarding the patients or like some of the kind of like lab work they have done or some symptoms right now you're trying to determine one or two a few classes, right? So whether they are kind of like belong in a particular class or not but still, you're talking about a few number of classes not many and the object recognition, right? Think about images, right. So there is of course, like why the use of image object recognition? Like give me an image you want to be able to identify like what objects are in those images, right? So you may have a few classes of objects, right? This could be like cat, dog or high level animals, cars, right? Houses or something, right? So you have those predefined classes and you have training samples, right? To tell you like which those objects are representative of certain classes, okay? Well, when we talk about prediction general, what we're referring to are numerical values. Okay, right, so here, instead of a kind of like finite number of category call classes, right? Here we're talking about predictive value, like a specific value usually like continues to value, right? So think about your stock price prediction, right? So you're predicting a particular value right? For the stock price, okay, and that of course, changes in many scenarios. It's typically like a continuous value within a certain range and then if you look at the traffic volume, right? So you see how many cars or how many vehicles, right? With the past particular like intersection, right? So again, you're talking about american values, right? And the number like here, of course, because it's a discreet value, but it's still kind of like potentially infinite number of values and then likes, right? So number of likes, so this is think about like your online social media, right? You post the sunset and then you're trying to determine how many likes you will get, right? That's basically just the notion of a popularity, right, or impact based on that particular post. So you're then kind of predicting right? The value of the number of likes, okay? So as you can see, right? So there's two quite different right? Categorical values versus an American values, okay. But in both cases, right, you would have some kind of training data, Right? So that's why I gave the order for under like a supervisor learning because you have training data telling you, okay, some sample cases, right? For the sample cases, what is corresponding like class class label or what is a corresponding kind of stock price or corresponding number of likes? Okay, so with that the general look at the classification process, okay? Because it's supervisor learning, right? So, and you have training data, right? So you start with the first step which is learning, right? Okay, so in this case you're provided with some training data, okay? So the training data would include the objects with a particular attribute values, right? And also the corresponding class labels, okay? Because with that then you can construct your model, right? So that you're basically trying to establish some kind of relationship between the object attributes and the class label, okay? With that then you can do the classification because this is the case where, say you're being provided with the test data. So the test data, of course, it doesn't have the class label, right? And your goal is to determine the class label, right? For each of the objects in the test data, okay? So with that of course the test data is that like you you will need to determine the class label and then compare, right? Because for the test data you would still have the grant to label but that's for evaluation. So you look at whatever like your model says right about the class label and then compared to the actual class label, right? So that allows you to evaluate the performance of your model and maybe like improve it or like a selected between different models, right? And also with all that, right? And then when you move to the real world deployment, right? This is actually the case, we will see you have new data. Right? So apparently new data would have the object attributes but you don't have the class labels, right? That's not even for evaluation so you basically just use your model to make a decision about the class of labels. Okay? So because of that, right, you really need to spend a lot of effort in step one or two because you want to make sure you do a good job so that when you deploy it in the real world, your model is likely to work well, right? For the data that hasn't seen before okay? There's also actually a lot of active research in terms of model adaptation because whatever you have learned in the kind of lab setting may or may not be the same as the real world? And in the real world, things may change so there's actually a lot in terms of how you can further improve your model by adapting to your actual deployment. The one big part, right? With any modeling like strategy is about evaluation because you need to know how where your model works right? because you may have trained your model using the training set, now you have set, right? And New York makes expecting some kind of real world deployment. So you won't really want to make sure how well your model works. Okay, but there are actually multiple criterias for you to consider when you're working on your classification model. The first one naturally is about accuracy. Right? So databases that like for classification because you have a limited number of class labels, right? You're basically comparing whether the predicted like the class label that in your model determined. Right, doesn't match the actual class of your grand choose information. Right, that's for classification. But for prediction, since you're looking at the numerical values, right? So many times you don't look at it, just say whether you have hit the exact value. That's really hard right? Single your stock price. I think about like number of likes, right? You may not be able to predict exactly that numerical value, but then you're looking at how close your prediction is, right? So you say like even though it's not exactly the same, but if I'm very close compared to another model, that is further away, right? So that is how you compare the accuracy or the preciseness of your results. And then the next one is about speed right? Because apparently you won't have a model that's efficient. Okay, so here when we talk about speed and consider the two cases because you need this more like offline learning, right, or training space. Right? So that is when you're constructing your model, right? How long does it take for you to construct your model? Right? And then of course, once you have trained your model. Right? So for the online use you basic just doing the online decision making. Okay, so those two are separate, right? So many times of course you say I care more about the online decision making part right? So I need to have a decision like the classification results right? So that's one part. But also this kind of the model construction in the offline is that it is important because it takes too long right? That's actually probably not desirable, especially if you're trying to adapt your model in the real world. So, if your model can be trained or retrained quickly, right? Within your data, that's also like something you would prefer and may be very important for some real world applications. The third one is interpretability. Okay, we will see this one is very important and you will see there's actually quite a bit of a variety in terms of the different classification methods, right? Some measures is reasonably easy for you to use the result in the process to explain why you come up with a particular classification decision right? While others are more like a black box, right? It performs well so that means the results seem to be correct but I have no idea how it's working or why it's making that particular decision. So based on the interpretability of your methods, then like they may or may not be applicable for certain scenarios, right? For example, if you're trying to make admission decision, right? Or learn approval decision or medical diagnosis right? If you just have a black box is probably very difficult to use it. Okay, so really always think about like any modeling process you have done or any model you're creating for classification problem. Right? Think about whether you can explain the result and explain the decision. The fourth one is about the robustness. Okay, so this just generally refers to noisy data. Right? So we talk about how in the real world there's really that perfect data. So the data can be noisy in many different ways. There may be issues you have maybe missing information. Right? So the question is read about whether your model can still give robust decisions, right? Classification results in the scenario that your data may not be perfect. Okay, And that again is a very important thing to consider for your real world application scenario. And then the next one scalability. Because nowadays you are dealing with a large data problems. The scales of the problems are usually big, right? So you can say if you have like 10 times more data points, right? That's your modeling process or model income decision making process still efficient doesn't scale well with that increase. Another important angle is about this notion of incremental data and will be mentioned is actually out of multiple places when we talk about your modeling method because the incremental databases says that if you have new data, right? So you already have some data, you construct your model but now you get some more data. Okay? So do you need to redo the whole thing? So I mean that not have like the larger data set, so I just redo my training with all the data or you can take the model, you have constructed already and then take the new data to update your model, okay? because these two apparently can be very different in terms of efficiency Right? So if you can actually do this in the incremental fashion, so that means especially with kind of stream processing cases, you have data continues to come in, so it's much more efficient for you to be able to incorporate right data in the incremental fashion rather than having to just plus everything and redo your construction from scratch. So these are some of the core criterias and they of course, depending your application scenarios, some of those are maybe more important, some of the less important. So you always talk about some kind of trade offs, okay? And there are potentially other criteria depend your case. So it's always like as you're working on the classification problem, understand the problem setting, understand the criterias that you're trying to satisfy Right? And that would then be very important when you're choosing one method to use, right? And when you're evaluating, see whether like a particular method is meeting the requirements of your particular problem. All right so let's start with the first classification message which is very widely used in many application scenarios and that has been shown to be providing very good results, okay? And it's fairly simple, so it's referred to as a decision trade induction, okay. So we have we have heard a decision tree in many settings. Right? So let's think about concrete example. Okay, so here I am looking at loan application approval. Right. So I'm getting a lot of many applications, they are requesting for some no and you are now making decisions about whether to approve that application or not. So your decision or your class label would be two, Yes or no approved or denied. Okay? Of course for you to make your decision, you need to look at the application and the specific attributes in the application you'll be looking for and then use that to make a decision. Okay, so those will be the kind of attributes, Right? So for each applicant or each application then you need to know some information. Right about this particular person, okay? So, you may have ID number. That's simple, right? That ID be just like we should like applicant ID. But then you would have the aging information right? Like whether this person is a student or not, okay? The kind of annual income information, let's say or credit rating or some any other things that you think that may be relevant. So, you'll be asking like quite a bit of information right? In this application phone so that this information can help you to make the decision, okay? So, with the decision tree induction, what we will actually end up with the model is a tree, okay? So, the tree actually allows us then to kind of really kind of make a decision just by asking specific questions or checking the attribute values, right? Of particular information. So, if you look at my tree here, so one way to look at it that you start from the top, right? So, once you get to say okay, I get a new application. What I say, you first ask, is that what's the age of the applicant, right? And then depending on the value of that particular attribute, right? So, the age of say, okay, if it's like at most 30 right? So, 30 years old, okay? And then so that's I follow this branch, right? Because depending on the attribute value, I take a particular branch and once I get to the next level, I ask another question, right? So, basically pick another attribute to ask. So, I say is this applicant a student or not, okay? And then again you get the attribute values. This could be a yes or no answer, right? Is this person student on that? And they say yes, well okay, you come to the left branch, right? And now at this stage you're actually ready to make your decision, Right? So, this is yes. Okay, so that basic kind of shows like a simple process, right? Because once you have this decision tree, all you need is just to follow the decision tree right from top to bottom branching based on the specific attribute values for that level, until you reach at the leaf note the lymph nodes as well when you make your decision, okay? So, that's how a decision she looks like. And you can see like one benefit is really fairly easy to interpret, right? By just looking at a decision tree, you can know exactly write, how you arrived at a particular decision by following those, right? So, you say I'm approving this around because of those, those like attribute values, okay? Or this loan cannot be approved or declined because of this branch, okay? So, those are kind of the main idea of decision tree induction. Okay, now the question, of course that, okay, yeah, if I get this tree, I seem so easy to use it. But how do I construct it, right? Because apparently I'm taking the long list of applications with attribute values and with their like training data with the decision label. But how do I turn that into this, right? Like this tree, okay? So, when you look at this tree, apparently there are two basic kind of like operations. Okay, so, one part is about attributed selection. So that means at any level, right? You need to ask a question and they say which question do I ask about? So, that is about which attribute, right? So, do I ask about the age? Do I ask about this person is a credit rating? Or do I ask what, right? Because the level, right. And the higher the level can help you to say which question do you ask? In what order? So, that's the attribute selection part. But once you have selected attribute right, then you need to figure out the branching conditions, right? Because you may have different values if it's just us, you know, it's probably easy. I just have to branches, right? But it's kind of like a multiple potential values, right? Then you need to figure out how you do the splitting right? You can see it's always binary. Either you follow this one, another one or I can have multiple branches depending on some grouping of those attribute values, right? So, in my case, the age here, right, instead of like 30 31, 32 33 like one branch per age value, I can group them, right? So, basically roughly kind of corresponding to some heuristics. So, that when you do the grouping, right? So, the objects with similar classes are more likely to follow the same branch, okay? So, that's the basic step and we'll discuss in more detail how we actually do that, right? But once you are able to kind of like do the attribute selection and the actual split, right? So that's for each level, okay? Then the overall process, right? In terms of decision tree induction, like constructing your model. First it's a top down approach right? You started with the very beginning like at the top note, right? When you don't have any tree yet, right? So, you Start with the first level, you pick one attribute as your note. And then based on the splitting you have your branches and now you get to the like the sub trees, right? So, that's really like it's why it's kind of recursive and also you have this kind of divide and conquer, right? Because once you branch off to this section, this sub tree, you don't worry about the other one. Like it's more kind of separate, right? They just go from there. But you take your sub tree at this point and then now you continue recursive to say, okay, I already know this is this person is within a certain age group, right? What's my next question, okay? And you do that. So, it's a recursive. It's divide and conquer, like algorithm. But I also know this is a greedy algorithm. So, the greedy means that it may not be globally optimal, okay? But it has actually risen performance actually, it's very easy to use okay. So, that's why decision tree induction has been broader use and has been shown to be providing good results and interoperable results, okay? So, let's look at like specifically one method of during the decision tree induction. As I said, the two key components, right? One is about the attribute selection and one is about the attributes split, okay? So, let's look at the this message called information gain. So what is trying to do is trying to look at the information your data set, okay. So, the intuition is that like basically you look at the class distribution in your data set or subset, right? Because if if everybody in that particular subset right has the same class label. Like all those are approved cases, all those are declined cases, right? That's easy, right? That basically you have the pure class distribution. And when it's purer than that means you have smaller right? Class entropy. And that's been mentioned as calculated by as information. So, that means if it's pure then it's a better right? Because you want to get to like when you know everybody is similar like they have the same class of able that's when you can make a decision. But if you are said that okay, like some of them are approved, some of them not, then you cannot make a decision yet, right? You need to go further to ask more questions, okay? So what we're trying to do is that we start with the original set, right? So, original have a data set of D. And I know I need to like work with and different classes, right? So, sister buys the Ci's class label, okay? So, I can learn quickly already calculated the distribution, right? So, the piece up i is basically just the percentage or proportion of a number of objects in the data set belonged to the particular class label, right? Once I have that that I compute information, right? So, that's really like the entropy computer computation. So, it's a piece of by times the log base to over PI, right? And then like you sum over all the M class is okay? So, that's your initial information, okay? And then, based on intuition, we're trying to split, right? So, I'm trying to pick an attribute and split the data set of D into subsets so that the subsets have pure class is pure class means the smaller entropy or smaller information. ,right. So I'm talking about the reduction, okay. So if I say I have data set D, now I try my different Attribution, right? So I have say like 10 different attributes. So I go through each of them, and then I decided to kind of split. So here the with the information gain myself, the spitting is kind of one branch per value, okay. Of course that's for categorical value, right? So you cannot have those individual branches. So in this case, attribute A has V potential different values. Okay, so I'm talking about a V branches, okay? So, your information of using A to classify D or separated D, is then calculated by the summation of the V different values, right? You have V different values. So you have V different branches, okay. And within each branch, right? You space the way to the some, right? That's the size of that branch, or how many objects for within that branch. And then the information of that subset, right. So that's the D sub J. So that means once you pick attribute and you pick a particular like the branch. So that means all the candidates or objects that should satisfy that a particular attribute value, right? So in my case, like the early example, I say all the applicants who's in the age group of at most 30. They go this branch, and that's your subset. Within your subset, you can use the same kind of information calculation to do it for the subset of the subject. And then you do the way to the sum, okay. So as you said, because we're trying to reduce the information, right? So as we go through the different attributes and compute there new information, I want to pick the attribute with the largest information gain. So we need that baselines here, right? That's when you have the original data set. And then for different attributes, they are reducing the information at some level. So you can pick the one that has the largest reduction, and that's the information gain part we're trying to look at, okay. So the idea is that, the attribute with the largest information gain is more powerful or more discriminative. In terms of separating the different classes, and they are resulting in pure classes, okay. So let's look at a concrete example. Okay, so we have my table, right? So this is my list of applicants. I have their age, income, student credit rating information, and then I have my class label, right? Which is the known decision, right? Whether they're this particular application is approved or not, okay. So I have in total 12 applicants, okay. And then I have two classes, right? My decision is for approving this loan yes or no, right. So you can already calculate, look at this table is okay, 7 of those my 12 cases are approved and then 5 of them are declined, right? So, we're using my information calculation, right. So I can already compute this base information, right? So at the very beginning before I do any splitting or branching, right. I have this 75 split, okay. So we do that 75, then you can calculate using our formula here. So it' 7. So this is the use cases, right 75 x 12. So that's like the 1st class, okay. And then 5 out of 12, that's the decline class. And then this will allow you to compute, so my basic information is a .980, okay. Now I need to go through the different attributes, right? And I'll calculate, like for each attribute, how I can reduce that information, okay. So here my example, is age, okay. So I have attributes age, okay. So what I need to do, of course like I have the branching cases, right? Since I have 3 different values, okay. So I need to say, okay, I have the three branches. Okay, and announce it for each branch within that subset, then you want to look at the class distribution, right? [COUGH] For example, if you take all the applicants whose age are at most 30, okay. So, if they have total, how many of them would go to that branch, but within that branch right. How many of those cases got approved, how many of them are not approved? And then you do that for the 31 to 40, that's the middle branch. And then you have like above 40, so that's the right branch. Okay, [COUGH] so you will take a little time so you can work out to the numbers like how many go to each branch. And within each branch, what is the class distribution, okay? All right. So if you look at this table, right, just kind of walk through this table little bit. You would be able to get to that, for people for the applicants, right. Out of the 12 applicants, okay. I have 1,2,3,4,5, so there are 5 people who are at most 30 years old, right? So among those 5, then you need to look at their known decisions. Okay, you actually have this (2,3) split. So that means there are 5 people in that branch. But among those 5 people, 2 of them are approved and 3 of them are not approved, okay. And then you can do the same same for the middle branch, that's 30 to 40, okay? That basis says that I have 3 people following that branch. All 3 of them actually approved. So that's the 3 yes cases and zero is the decline case, okay? Then the right branch, right, same thing you can say there's a 2 to split. That means I have kind of 4 people following that branch, 2 of them approved, 2 of them not approved, okay. So with that right now, you are able to compute the information for using age as the attribute, okay. So I say, I take a data set of D, I then use age to divide them up. So I have the 3 cases, remember we're doing the waited sum, right, of the 3 subsets, okay. So, the first case is 5, right? 5 of them with (2,3) split and then the middle one 3 of them with (3,0) split. And then the red branches, 4 with the (2,2) split, okay. So, if you use a formula like this p sub i times log base two of p sub i, right? And then sum them up across the classes, then you would have the information if you use h, right, to do the branching. Okay, so that's the 0.738, okay. So with that then, we can compute the information gain, right? So that's really the difference from the initial Information, right? That's the 0.980, we calculated in the previous slide, okay. So minus the new one, so that's the Information gain for this 0.242, okay. So this is for age, right? So you can repeat the same process for at the attributes, right? So, income, student, credit rating, and once you have done all that, now you just compare the information gain, okay? So in this particular example, if you compare across the different attributes, you will see that age is the one providing the largest information gain, okay. So that's why our earlier example shows that h as the top note in the decision tree, okay. And then you can repeat that process, right? Once once you get to the sub branch, then it's easier to process. So will read you the same calculation, but now you're using the subset instead of the original full data set. All right, so, besides the information gain, methods there are actually other methods. They're also trying to adjust the specific issues, with initial method, okay? So with the information gain, because it's looking at the distribution, right? The branches. But the sense is having one branch per of attribute value, right? Think about the extreme case where I have ID, right? So, if you look at this one right? So, if I'm using ID as potential attribute, okay? So, I have 12 cases and I have 12 unique values. So, I would have 12 branches and issue containing only a single case, right? Since the single case is pure, right? So, you want to say quickly you get all the individual ideas and then you have the corresponding class label and it's a pure distribution. But that's not a useful, right? So when you get a new applicant is going to be a new ID right? Apparent is not helping you. So, the notion is really like is about this split info. So, the gain racial method, okay later on actually add this denominator that is the split info. So, basically split in four is just based trying to look at how the branches that happen. So, here basically just look at the size of the subsets you're creating, okay? And it's trying to kind of like generate the subsets that are kind of smaller, like a fewer subsets, okay? [COUGH] And then because of this the denominator, right? So, you take your gain, so if you have a larger gain but you're spreading too much then that actually reduce your gain ratio, okay? Why if you have a larger gain and you have kind of fewer kind of branches? That's actually a good case. So, they're actually that you would choose that over another scenario where you have too many branches, okay? Another message which is called like cart is using the Gini index idea. So, the key heuristic of this approach is that instead of like one branch per attribute, okay? It's actually trying to just limited selected to do to. Okay, so it's always like a binary branching, okay? So, you can have a category value, can have numerical value but the binary branch basis is that depending what value have, you just go to a particular like left branch or right branch, okay? So, that in a way kind of controls right, the number of subsets you would generate. And because you all have the same number of like branches, right then the the improvement, right? In terms of the pure classes are more directly comparable across the attributes. Okay, so as a result, right? So, the Gini index method actually generally give you kind of nice kind of like binary and roughly kind of balanced like separation. So, it actually allows for like more effective kind of like branching in many cases. So, all those three mess of course, as you can see they have a slightly different design, right? Even though the overall process is the same. It's more about what metrics they're using, right? And how they're branching their attributes, okay? The next one. Let's look at another why did you use the classification method, okay? So, this was referred to as a basic classification. So, you may have heard the basic theory, right? So, any kind of like a publicity kind of like intro, of course you should have learned the basic theory basically just says that a half like X is whatever I'm observing and I have a hypothesis, right? So, giving X right? What is the probability that a particular hypothesis is true, right? Turns out like to calculate this right? You can actually convert it into this form. So, you basically calculate the general probability of that particular hypothesis happening. And then giving that particular hypothesis is true then how often do you see like X. Like something with this particular attributes, right? And then you divide that by the probability of X occurring, okay? So why is this useful and why? Like how do we use this for classification, right? So, the key idea of course is that it's a probability, right? So, because my classification problem is that I have object right? I have the object attribute values. So, that's the observed value. And now I'm trying to determine which class right? This object should belong to. So, now your hypothesis right? Could be then be viewed as the like the hypotheses say X to class one, X to class two X to class M. So, you can use that then to compute the probability right of your object X belong to each of those m classes, right? And once you have computed the probability, then of course you say I pick the highest probability because that's the most likely like class label that I should choose, right? So, it's a statistical method, has this kind of like solid statistical foundation and has been like used in many cases and with very good results. So, in this case right? The satellite, you have X right? You have X as a data sample. It has the attribute values. And then you just say, okay, how do I determine the class label, right? So, once I said edge to be the hypothesis, right? Belong to particular class or I need is to then compute my this value, right? So this is a giving X what is the probability that particular hypothesis is true, okay? And then using basic theory, then this then translate to calculate the P of X. So, that these are the prior probability, right? And then you have this conditional probability. Those are the posterior probabilities, okay? So, all you need is calculate the individual and then be able to calculate the probability for each of the classes hypothesis okay? So, to put more specifically, so this is the wider use the method referred to as a naive basing classifier, okay? So just as we have said, like all you're trying to do is that you iterate through the impossible classes, right? And you're calculating the probability of P by giving X, what is the probability of hypothesis C sub I would be the case, right? Note that because you are denominated here, that's the P of X. That's the probability of X occurring. So, giving like the X particular attribute values, right? So, since I'm comparing across multiple classes and just pick the largest one. And since the denominator is the same, that I can ignore the denominator. And I only look at the nominator right? To pick the highest value because that would be the class label I need to choose, okay? But then why it's called a naive, because I say it's a basic classify, we get that is using the basic aerial right? But that's very important to kind of like assumption here, which is referred to as the naive assumption, okay? Remember here when you look at your X. X is your object, right? But X is being captured by multiple attributes. You have the indifferent attributes, okay? So, for you to calculate the probability of like X. You actually need to them divided app and look at the possibility of each of the individual attribute values occurring, okay? So, and we know that like if they're independent, then it's actually easy, right? You can actually quickly take the multiple attributes and then combine convert that into application of the individual probability of individual attribute values, okay? But that is assuming right? Those and attributes are independent, okay? So, if that assumption holds then you can like already calculate this probability of X. Giving C sub i by multiplying the probability of each of those attribute values occurring, giving C sub i, okay? Then the question, of course, is that whether this is true or not right? Whether this naive assumption would apply in your real water setting right? You're going to say that well. Many times, yes. And actually many times, even if it's not exactly independent but this assumption action may still hold like mostly. And that actually has been shown to be actually usable in many real water settings, okay? But there's one bit of kind of detail that we need to pay attention to. That is a zero probability. Zero property basis says that if one particular attribute value doesn't happen in your dataset, for example, like if you're looking at you can use my age like value, right? So if you somehow you're not doing the grouping but you look at a particular age value and somehow I don't have anybody for dumping your dataset whose age the 80 say, right? Or ten let's say, okay? But if you do that, right? Then when you're giving a new object, who happens that has that value that is not shown in your dataset, then you would have a zero probability because you haven't seen anything right? And you know that if you are multiplying those individual probabilities, having a single zero, right? In any of those attributes, then you would have a zero combined probability, right? Which of course, is not what we want, okay. So instead, right, if you see those kind of zero probability scenarios and the simple kind of fix is to use this laplacian correction. All you're doing just adding one to each of the cases, right? So for example, in your dataset and say, okay, 900 of them is the value A. And then 100 and Value B. But I don't have values C, right? So it's 900, 100 and zero. But if you add one to each case, right? You'll have 901, 101 and one, okay? When you calculate probability, right? Like now you don't have a zero probability. But the probability is still very close to your original probability. Okay, so, but that is important, right? So whenever you're using your Naïve Base and classified, just make sure you check for this zero probability case and corrected if that does happen. Okay, so let's look at an example. So, same table, right? This table we used earlier for the decision tree calculation, okay? So I have 12 applicants. I have their age, income, student credit rating information. Those are the attribute values, right? And then I'm trying to determine right, the class label for the loan approval, right? Yes or no, okay? So I can start with my priority kind of probabilities, right? So that's basically the probability of C, for each of the class occurring. Okay, I know that out of my 12 cases, seven of those applications were approved, five of them or not, right. So that's my priority probability for each of the classes. Okay, now I have a particular applicant, X. Okay, so X's age is at most 30, income is a medium student, yes, credit rating, fair, right? Okay, so with that, now, I need to compute the probability of X belonging to the yes class or the no class, and then pick the higher probability. All right. So, first I need to compute the probability of each of those attribute values occurring, right? Giving a particular class label, okay? So in this case, right, I'm looking at for X. If it's belonging to the yes class or the no class, okay? And just see how many of that happens, right? So, if it's a yes class, I look at age and you can actually already compute, so those are two industry really corresponding? How many of them correspond to one class label, right? So if you're in the yes cases, two of them has the age of at most 30, three of them are the no cases, right? Similarly you can do that for income value as the median, you can say, okay out of all my approved cases, three of them had the medium income, okay? Out of all the not approved cases, two of them had a medium income, right? Then for students, if it's yes, credit rating fails, you do the same thing, right? So each time, what do you do is that you look at the subset, right? Of known decisions. So, these are all the approved ones, right? How many of them has a particular attribute value corresponding to each of those attributes, okay? So once you have that now we're ready to do the actual computation, okay? So there's some details. You have to kind of follow the step a little bit, but at a high level, without looking under the actual values, you will say, okay, I need to compute right? The probability of say, if we know if yes, how often, what's the probability that I will see age, right, that particular age value? And also if known is yes, what's the possibility I see that particular income value? And if it's a yes, what's the probability of seeing that student value and credit rating? Right, so that's basically what you do, is you look at the numbers we calculate earlier and just kind of plug this in as the probability of, this is the posterior probability, right? Like assuming if it's this class label, how often or what's the likelihood for you to see that particular attribute value, okay? And then using the naïve assumption, right? You can just multiply those together. So that means if no is yes, right? So that's the priority probability. And then you have this posterior probability, if you know is yes and I know that if no is yes, the likelihood of seeing X is the multiplication of those individual ones, right? And multiplying this term, which is seven out of 12, right? Because seven out of 12 cases are approved. Okay, so we put all this together, right? That is the probability. Remember this is a nominator, right? The nominator for the first class label, right? The known is yes cases, okay? And then you can repeat the same process, but this time now using numbers for the known equals to no cases, okay? So you do the same thing and now you calculate the no is no times the if no is no, and what is the likelihood for me to see X, right. So, the final step is comparing the yes cases and the no cases, because you have the two probability values, right? When you're putting that together, then you say, okay, which one is higher? And that's the class label you're going to use, all right? Okay, so as we said earlier, with the naïve Bayesian classified, that is when you can make this assumption that your attributes are independent, right? So that is usable in many cases, even with a good approximation, but there are scenarios where you know, there is actually pretty strong dependency right, between some of those kind of attribute values. So you cannot just kind of like use the naïve classified instead, right? We are going to use a a way to actually explicitly capture those kind of dependency. Okay, so this is a general photos of the basic belief network approach, okay? So what is it doing, right? It's basically just trying to spare out the dependency. Okay, so let's look at my simple example here, okay? So I'm talking about rain, sprinkle and grass being wet. Right, so those are kind of the variables I'm considering, okay? So, the first one could say, okay, whether it's going to rain or not, right? So that actually by itself, there is some probability, right? Of rain being true or false, for example, in this case there's 30% chance that is going to rain, 17% percent chance that is not going to rain. Okay, and then the sprinkler system, so what did the sprinkle system will be operating, right, is actually a little bit kind of depend on the ring. Okay, so you can say like, if I have the variable of rain or not. Okay, and then I say, okay, I then there was the possibility of the sprinkler system will be working okay. So you can see like if it's raining, like true, but like even though when it's raining like that there's a very slight chance that my sprinkler system is still beyond. Okay, well, if it's a force that also have some probability of my sprinkler system being on or not. So you have this conditional probability knowing that whether it's going to rain or not. Okay, and the further, if you look at this variable, which is whether your grass will be wet or not, right. Apparently it is the would be dependent on both, right. If it rains, then there is a higher chance that my grass will be wet. If the sprinkler system is on, then there's also a higher chance that my grass is wet. Okay, so what you're looking at is that because this one is not dependent on both right. So you have your two variables. So rain is true of force sprinkles to forces basically look at the combination cases right. And based on that, then you can also have this conditional kind of probability table is though your grasping wet or not. Okay so that's as you can see like basically have this explicit kind of like a representation of the variables, their dependencies and also the probability of conditional probability of issue of those cases are happening. And that is what the basic belief networks can to capture. Okay the key notion is about this conditional dependency right across those variables. Okay, so how we do that right. We're using this probabilistic graphical model. Okay. Which is a DAG, so that is Directed acyclic graph. So it means you have directed pointers but you wouldn't have cycles, because cycles means that you are dependent on other and adults have been on yourself. Right. So that's not like feasible. Okay. But instead right, you take the variables, you start with the kind of the base ones that they are not depend on others. And then you add in the other variables that they depend on some of the existing variables. Okay. And the the directed kind of links kind of show who is dependent on whom and which ones. Right so once you have this kind of graphic representation right, then you need to put it in this conditional probability table. Okay, so that basic just shows that yes, this variable is dependent on those other variables. And then like the conditional public table just shows that okay, depending on the values of those other parents variables. And then this is the likelihood of this variable having certain values. Okay, so when you have that right, you have the huge graph then you propagate that, right. So when you once you have this kind of like leave network, right. Again, you compute the probability of the issue those scenarios and for the class label, right. The variable that corresponded to the class label you're trying to determine. You just calculate right, the highest probability and that is a corresponding label that you're going to use. Okay. All right. So far we have talked about like general introduction to classification and also then specifically we talk about decision tree, induction and also the basic classification, including both of the naive basing and also the basic belief networks. Okay, so next we're going to continue with some other methods and also how we're going to evaluate and compare across different classification measures