We just went through that notebook introducing Keras. In that notebook, we saw that we needed to actually implement some transformations in order to actually ensure that our neural nets performed optimally. Here we're going to talk about some other important transformations to keep in mind when training our neural net models. Let's go over the learning goals for this section. In this section, we're going to cover pre-processing and preparing your data for analysis, so all the steps that are going to have to come into play when you're thinking about creating a neural network. Part of that will be if you're doing multiclass classification, how to set it up so that you can predict across multiple classes rather than what we've seen so far, where there's just one class or the other. Then finally, we're going to discuss the importance of scaling your neural net models. We saw this in our notebook as we went ahead and we used the standard scaler in order to scale our data, and you can also use something like the MinMaxScalar, which we've seen in the earlier courses. For binary classification problems, I'm just trying to decide between two different classes, we have a final layer with just a single node and a sigmoid activation. We saw that in just our last notebook where we had a full dense network, they're all connected to that final node, there was only one node in that final layer, and we had a sigmoid activation function in order to allow for that output. The sigmoid activation function has many desirable properties. One is that it gives an output strictly between 0 and 1, and that value can be interpreted as a probability, so we can say which one is more likely and by how much. It's going to have a nice derivative, meaning that it's going to be easy to find the gradient as well as to use that to do back-propagation. It's going to be analogous to logistic regression, or you'll have a bunch of input go into that linearly, go into that node, and then you'll have that one nonlinear transformation as you do with logistic regression to output that value again between 0 and 1. The question is, is there a way to extend this to a multi-class setting if we're trying to predict across multiple classes? If we want to do this multi-class classification, we can use what we learned in regards to one hot encoding, and we use this most frequently when working with different feature variables and we're just going to use that concept for our outcome variable. One hot encoding, again, is four categories, and you can take, for example, a vector with length equal to the number of categories. Say that your vector just has one value for each category and those different categories are going to be, in this case, checking, saving, and mortgage the type of account that you have there. You can then represent each category with one at a particular position and 0 everywhere else. For example, with our bank account example, rather than just having 1, 2, 3, we can have three new columns where one of those columns is for checking, perhaps that top value was checking, we put a 1 there on top. Then that second value was savings, so we put a 1 in the middle and zeros everywhere else. Again, that top 0 would reference to whether that value is going to be checking that bottom one, or be whether or not it's mortgage. We put a 1 at that bottom value because that not bottom value was mortgage, and zero is everywhere else. For multi-class classification problems, we're going to let that final layer be a vector with length equal to the number of possible classes as we just saw on the last slide, and then we can extend the idea of the sigmoid to multi-class classification using this softmax function. That softmax function is just going to be the lead to whatever that z output was for a particular class, over the sum of e_z for all of the classes combined. What that does is it's going to yield a vector with entries that are going to be between 0 and 1, normalizing them all to between 0 and 1, and that will ultimately sum to one. Then we can get the probabilities for each one of the individual classes. For the loss function, when we even inputted it in, it's going to be categorical cross entropy that we're trying to calculate. This is just going to be the log-loss function in disguise. We take that cross entropy and that's equal to negative y_i, y being the actual values times log of y_i, whatever that prediction is. The derivative of this will have a nice property when used within the softmax, so that the derivative of that last z_i in regards to that softmax is going to be y_i the prediction, minus y_i. It's not like to discuss actually scaling our inputs. In our discussion of back-propagation, we briefly touched on the formula for the gradient used to update the values of our weights, W. In this, I promise to tie back into scaling our input, so just hold tight. But note to update our weights, we take the partial derivative in respect to W and we get again y-hat minus y, which is that first partial derivative, and the dot product of a, whatever that input was from the last layer. At each iteration of gradient descent, W_new, or new W is going to be that W_old minus a learning rate times this partial derivative. When i equals 0, we are using the input values of X of those actual inputs as part of the derivative to update W_new. Those input values at that first layer are going to play a large role. This is going to mean that if we don't normalize the input values, those with higher values are going to update much more quickly than those with lower values, because again, we're using the AI from that prior step in order to update our values. If we have them on different scales, higher values will update quickly and the lower values will not update as quickly, throwing off the way that we update our actual models. This imbalance can greatly slow down the speed at which our model actually converges. For that reason, we need to scale our inputs in different ways that we can scale our inputs that we've discussed in prior courses is the linear scaling to the interval between 0 and 1, which is going to be our MinMaxScaling, which is X_i minus X_min, over X_max minus X_min to ensure they're all between 0 and 1, or we can do here linear scaling to the interval between negative 1 and 1, which is just going to be 2 times X_i minus X_min, over X_max minus X_min, minus 1, and that just ensures that you have values between negative 1 and 1. Again, we could also use that standard scaler, sometimes we want these values between 0 and 1 or between negative 1 and 1, because if you think about using the sigmoid function or the hyperbolic tangent function, that will allow for each one of our inputs and outputs to stay on that same scale. Let's recap what we learned here in this section. In this section we discussed pre-processing and preparing our data for our neural net models. With that, we introduced how we can do multi-class classification with neural networks using that one hot encoding as well as the softmax function. Then we discussed the importance of scaling your neural network inputs to ensure that you have balanced updates of each one of your weights. We talked about how we can use different scalar similar to the MinMaxScalar or the standard scalar to ensure that each one of your values is on the same scale. That closes out our discussion here on different transformations that are important for your different neural net models. In the next section, we're going to introduce our first different type of model framework for our neural networks, namely convolutional neural networks. All right. I'll see you there.