Next, let's take a look at activation functions and how they help training deep neural network models. Here's a good example. This is a graphical representation of a linear model. We have three inputs on the bottom, x1, x2, and x3, shown by those blue circles. They're combined with some weight, W, given to them on each of those edges. Those are the arrows that are pointing up. And that produces an output, which is the green circle there at the top. There's often an extra bias term that's added in, but for simplicity that isn't going to be shown here. This is a linear model since it's of the form y = W1 * x1 + W2 * x2 + W3 * x3. Now we can substitute each group of weights for a similar new weight. Does this look familiar? It's exactly the same linear model was before, despite adding a hidden layer of neurons. How is that so? What happens?. Well the first neuron in the hidden layer that's on the left, takes the weights from all the three input nodes. Those are all the red arrows that you see here. And you can see that's w1, w4, and w7, all combining respectively, as you see clearly highlighted. Now as you take the new weight, and that's the output of the first neuron, which in our case is w10 now, as one of those three weights going into the final output. You'll see that we do this two more times for the other two yellow neurons and their inputs respectively from x1, x2 and x3. You can see that there's a ton of matrix multiplication going on behind the scenes. Honestly, in my experience, machine learning is basically taking arrays of various dimensionality, like 1d, 2d or 3d, and then smashing and multiplying them against each other, where one array or a tensor could be a randomized array of starting weights of the model, and the other is the input data set, and yet a third is the output array or tensor of the hidden layer. You'll see behind the scenes it's honestly just a lot of simple math, depending upon your algorithm. But a lot of it is done really really quickly. That's the power of machine learning. Here though, we still have a linear model. How can we change that? Let's go deeper. I know what you're thinking. What if we just add another hidden layer? Does that make it a deep neural network? Well unfortunately, this once again collapses all the way back down into a single weight matrix, multiplied by each of those three inputs. It's the same linear model. We can continue this process of adding more and more and more hidden neuron layers, but it would be the same result. I'll be it'll be a lot more costly computationally for training and prediction, predicting, because it's a much more complicated architecture than we actually need. So, here's an interesting question. How do you escape from having just a linear model? Well, by adding non-linearity of course. That's the key. The solution is adding a nonlinear transformation layer, which is facilitated by a nonlinear activation function, such as a sigmoid, tan h, or riilu. And thinking of the terms of the graph as created by TensorFlow, you can imagine each neuron actually having two nodes, the first node being the result of the weighted sum w times x plus b, and the second node is the result of that being passed through the activation function. In other words, they're the inputs to the activation function followed by the outputs of the activation function, so the activation function acts as a transition point between layers, so you get that non-linearity. Adding in this nonlinear transformation is the only way to stop the neural network from condensing back down into a shallow network. Even if you have a layer with nonlinear activation functions in your network, f elsewhere in your network you have two or more layers with linear activation functions, those can all still be collapsed down into just one network. So usually, neural networks have all layers nonlinear for the first n minus 1 layers, and then have the final layer transformation be linear for regression, or a sigmoid or a Softmax for classification. It all depends on what you want that final output to be. Now, you might be thinking, what nonlinear activation function do I use? There's many of them, right? You've got sigmoid, you've got scaled and shifted sigmoid, you have the tan h or the hyperbolic tangent, being some those earliest. However, as we're going to talk about, these can have a saturation which leads to what we call the vanishing gradient problem, where with zero radiance, the models weights don't update anything times zero, right? And training halts. So the rectified linear unit, or ReLU, for short, is one of our favorites because it's simple and it works really well. Let's talk about it a bit. In the positive domain is linear, as you see here, so we don't have that saturation, whereas in the negative domain the function is 0. Networks who have ReLU hidden activations often have 10 times the speed of training, than networks with sigmoid hidden activations. However, due to the negative domain's function always being 0, we can end up with ReLU layers dying. Now what I mean by that is, you'll start getting inputs in the negative domain, then the output of the activation will be 0, negative times 0 is 0, which doesn't help in the next layer getting the inputs back into the positive domain. It's still going to be 0. This compounds and creates a lot of 0 activations. During back propagation when updating the weights, since we have to multiply our errors derivative by the activation, we end up with a gradient of zero, that's a weight update of zero, thus, as you can imagine with a lot of zeros, the weights aren't going to change and the training fails for that layer. Fortunately, this problem's been encountered a lot in the past, and there's a lot of really cool, clever methods that have been developed to slightly modify the ReLU to avoid the dying ReLU effect, and ensure training doesn't stall, but still with much of the benefits you would get from a normal ReLU. So here's the normal ReLU again. The maximum operator can also be represented by a piecewise linear equation, where less than 0, the function 0, and greater than 0, the function is x. Some extensions to ReLU meant to relax the nonlinear output of the function and to allow small negative values. Let's take a look at some of those. Softplus or a smooth ReLU function. This function has its derivative as the logistic function. The logistic sigmoid function is a smooth approximation of the derivative of the rectifier. Here's another one. The Leaky ReLU function, I love that name, is modified to allow those small negative values when the input is less than 0. Its rectifier allows for a small nonzero gradient when the unit is saturated and not active. The parametric ReLU learns parameters that control the leakiness and shape of the function. It adaptively learns the parameters of the rectifiers. Here's another good one. The Exponential Linear Unit, or ELU, is a generalization of the ReLU that uses a parameterised exponential function to transform from positive to small negative values. Its negative values push the mean of the activations closer to 0. That means that activations that are closer to 0, enable faster learning as they bring the gradient closer to a natural gradient. Here's another good one, the Gaussian Error Linear Unit, or GELU, that's another high performing neural network activation function like the ReLU, but it's nonlinearity results in the expected transformation of a stochastic regularizer, which randomly applies the identity or zero map to that neuron's input. I know what you're thinking, that's a lot of different activation functions. I'm very much a visual person. Here's the quick overlay of a lot of those on that same xy plane.