In order to understand the intuition behind how neural networks work and how we train them, let's start by understanding how an individual artificial neuron works. There are different types of artificial neurons but let's start with the first and the most basic which is called the perceptron. So the perception is a simple model where we take a set of inputs x multiplied by a set of weights or coefficients w. We sum the results together and we pass them through what's called a threshold function where we can pair the output of that some z to 0. If the output is higher than 0, we output a 1. If the output is lower than 0, we output a -1. Therefore the perceptron is a model that would be used for a binary classification type of a task. As you look at this model, you might recognize much of this from our discussion on linear models. And, in fact, the perceptron really is a very simple linear model starts with a linear combination of our input features x times are coefficients or weights, sum of those together and compares against the threshold function to generate an output prediction. Another type of artificial neuron is logistic regression, which we covered in an earlier module. Logistic regression is, in fact, very similar to the perceptron but we now add in one more component to our model which is the activation function. Or in this case of logistic regression, we use a sigmoid function as our activation function. So in logistic regression, we start with our input x, we multiply each of our features in x times of weight or coefficient and we sum them together z score. We then pass our z through the activation function or the sigmoid function for logistic regression. And as an output of that, we get the probability that y is equal to 1 or the probability that that data point belongs to the positive class. We then pass this probability that y is equal to 1 through our threshold function. And if the probability is higher than 0.5, we say that our prediction y hat is 1. If the probability is lower than 0.5, our prediction y hat is 0. We can also use that in their immediate value, the probability y equals 1 that came out of our activation function in order to calculate our cost or a loss. So our objective in logistic regression as well as the perceptron and all of our other models is to find the values of the weights that minimize this cost function. So let's now walk through the process of training in artificial neuron. Again, our goal in training and neuron is to find the values of the weights that minimize our cost function. As we remember, to minimize a function, we can take the derivative of that function and set it equal to 0. When we cover linear regression models, we could simply take the derivative of our cost function set equal to 0 and calculate the weight values that made that equation 0. When we introduce nonlinear activation functions such as the sigmoid function that's used in logistic regression, there is no longer an easy way to find a closed form solution to solve for the weight values that make the derivative equal to 0. Therefore we use an iterative solving methods such as gradient descent. We start with some random initial values of our weights, we calculate the cost and then we slowly move in a direction opposite the gradient or derivative of the cost function. Towards the point where we reach a minimum cost level and we solve for the weights that achieve that minimum value for our cost function. Let's take a look at how we do this using a process called stochastic gradient descent. In stochastic gradient descent, we use one data point at a time. We perform gradient descent, we update our weights, then we take another data point, do the same thing and we continue on through our entire dataset until we've used every point. The step one in training a neuron using stochastic gradient descent is called forward propagation. In this step, we take our first data point and we forward propagate through the model. Meaning, we take our data point, we multiply our input features times the coefficients or weights. Calculate our z, pass our z through our activation function, which is the sigmoid function for logistic regression, and we calculate our y hat or prediction as an output. Once we've calculated our y hat prediction, we then able to calculate the cost using that prediction comparing it to the actual y value and also calculate the gradient of that cost function. We calculate the gradient of the cost function with respect to each of the weights or coefficients that are in our model. Once we've calculated the gradient of the cost function with respect to each weight, we're now able to update the values of each of those weights using our gradient descent process. So our new value for a weight is equal to the previous value of that weight minus our learning rate times the derivative of our cost function with respect to that weight. We can go through each of our weights and update them using this update rule. We then repeat the process by taking the next data point in our data set, passing it through our model, calculating our y hat, calculating our cost and the derivative of a cost and then updating the weights one more time. And we continue this process until we've looped through our entire data set. Eventually our gradient descent process should converge to values of weights that result in the minimum cost and these are the weights that we then use in our final model. One of the key parameters that we need to set to enable this process is the learning rate that we saw in the previous update equation. The learning rate controls how big of a step we take each time we perform that gradient descent step. As we cover later in our section on neural networks, the learning rate can have a big impact on your ability to train on neural network. If you set the learning rate too small, each time you perform that gradient descent step, you're taking a very, very, very small step. And as a result, your algorithm may take a very long time to converge. On the other hand, if you set too large of a learning rate, you end up taking large steps at each time and you may bounce around on your cost function and never find that minimum value. So setting the learning rate to a point where it's neither too small and takes too long or too large and runs the risk of diverging is one of the key things that you need to focus on as your training neural network models. In the previous example, we trained an artificial neuron using what we called stochastic gradient descent. Or taking one observation or data point at a time to iteratively calculate the gradient and update the weights, then moving on to the next and looping through our data one point at the time. This approach works very well for large data sets and it's also the primary approach used in what we call online learning. Which is the case where we have a production model that we're receiving data points from a user, for example, one of the time and each time we receive a data point, we're retraining and updating our model. One of the downside of stochastic gradient descent is that we have to use a loop using a single observation at the time and therefore we can take advantage of more efficient vectorized or matrix operations. The alternative approach is what we call batch descent. In batch gradient descent, we're using the entire data set for each update. So we're calculating the gradient and we're updating the weights based on all of the observations in our training data set at each iteration. The primary advantage of batch gradient descent is that we can now take advantage of vectorized or matrix operations and performing this and we can perform these operations much more efficiently. One of the challenges of batch gradient descent is that if you have a very large data sets, sometimes can be impossible due to the compute power required to perform batch gradient descent at each iteration. So as a compromise, it's very common that we'll use an approach in training neural networks called mini-batch this gradient descent. In mini-batch gradient descent, we divide our training data up into smaller subsets or smaller batches. For example, a batch of eight observations at the time or 32 observations at a time. We can then perform batch gradient descent using all of our observations within this mini-batch each time. Therefore, we're able to take advantage of the vectorized operations that we can accomplish using batch gradient descent. But we're not using nearly as much computational horsepowers as if we were to try to use our entire data set within each batch. Mini-batch gradient descent is very common in training neural networks because it works very well for large data sets, while also still allowing us to achieve efficient computational operations. One of the challenges that you'll find with mini-batch descent is it's not as good as stochastic gradient descent for online learning when we often have a single observation coming in at the time for a user and desire to retrain as each single observation comes in