Now we'll look at another way to estimate w and b for a linear model, called ridge regression. ridge regression uses the same least squares criterion, but with one difference. During the training phase, it adds a penalty for feature weights, the wi values that are too large as shown in the equation here. You'll see that large weights means mathematically that the sum of their squared values is large. Once ridge regression has estimated the w and b parameters for the linear model, the prediction of y values for new instances is exactly the same as in least squares. You just plug in your input feature values, the excise and compute the sum of the weighted feature values plus b with the usual linear formula. So why would something like ridge regression be useful? This edition of a penalty term to a learning algorithms objective function is called regularization. Regularization is an extremely important concept in machine learning. It's a way to prevent overfitting and thus improve the likely generalization performance of a model by restricting the models possible parameter settings. Usually the effect of this restriction from regularization is to reduce the complexity of the final estimated model. So how does this work with linear regression? The addition of the sum of squared parameter values that shown in the box to the least squares objective means that models with larger feature weights w add more to the objective functions overall value. Because our goal is to minimize the overall objective function, the regularization term acts as a penalty on models with lots of large feature weight values. In other words, all things being equal, if ridge regression finds two possible linear models that predict the training data values equally well, it will prefer the linear model has smaller overall sum of squared feature weights. The practical effect of using ridge regression is to find the future weights, wi, that fit the data well in a least square sense, and that set lots of the feature weights to values that are very small. We don't see this effect with a single variable linear regression example, but for regression problems with dozens or hundreds of features, the accuracy improvement from using regularized linear regression like ridge regression can be significant. The amount of regularization to apply is controlled by the alpha parameter. Larger alpha means more regularization and simpler linear models with weights closer to 0. The default setting for alpha is 1.0. Notice that setting alpha to 0 corresponds to the special case of ordinary least squares linear regression that we saw earlier. That minimizes the total squared error. In psychic learn, you use ridge regression by importing the ridge class from sklearn.linear_model, and then use that estimator object just as you would for least squares. The one difference is that you can specify the amount of the ridge regression regularization penalty, which is called the l2 penalty using the alpha parameter. Here, we're applying ridge regression to the crime data set. Now you'll notice here that the results are not that impressive. The r squared score on the test set is pretty comparable to what we got for least squares regression. However, there's something we can do in applying ridge regression that will improve the results dramatically. So now is the time for a brief digression about the need for feature, pre-processing and normalization. Let's stop and think for a moment, intuitively, what ridge regression is doing. It's regularizing the linear aggression by imposing that sum of squares penalty on the size of the w coefficients. So the effect of increasing alpha is to shrink the w coefficients towards 0 and towards each other. But if the input variables, the features have very different scales, then when this shrinkage happens of the coefficients, input variables with different scales will have different contributions to this l2 penalty. Because the l2 penalty is a sum of squares of all the coefficients. So transforming the input features, so they're all on the same scale, means the rich penalty is in some sense applied more fairly to all features without unduly waiting some more than others just because of the difference in scales. So more generally you'll see as we proceed through the course, that feature normalization is important to perform for a number of different learning algorithms beyond just regularized regression. This includes k nearest neighbors, support vector machines, neural networks, and others. The type of feature pre-processing and normalization that's needed can also depend on the data. For now, we're going to apply a widely used form of feature normalization called MinMax scaling, that will transform all the input variables so they're all in the same scale between 0 and 1. To do this, we compute the minimum and maximum values for each feature on the training data, and then apply the MinMax transformation for each feature as shown here. Here's an example of how it works with two features. Suppose we have one feature, height, whose values fall in a fairly narrow range between 1.5 and 2.5 units. But a second feature, width, has a much wider range between 5 and 10 units. After applying MinMax scaling, values for both features are transformed, as they are on the same scale, with the minimum value getting mapped to 0, and the maximum value being transformed to 1, and everything else getting transformed to a value between those two extremes. To apply MinMax scaling in psychic learn, you import the MinMaxScalar object from sklearn.preprocessing. To prepare the scalar object for use, you create it, and then call the fit method using the training data X_train. This will compute the min and max feature values for each feature in this training data set. Then to apply the scaler, you call its transform method and passing the data you want to rescale. The output will be the scaled version of the input data. In this case we want to scale the training data and save it in a new variable called X_train scaled. And the test data saving that into a new variable called X_test_scaled. Then we just use these scaled versions of the feature data instead of the original feature data. Note that can be more efficient to perform fitting and transforming in a single step on the training set, by using the scalar's fit transform method as shown here. There's one last, but very important point here, about how to apply MinMax scaling or any kind of future normalization, in a learning scenario with training and test sets. You may have noticed two things here, first that we're applying the same scalar object to both the training and the test data, and second, that we're training the scalar object on the training data and not on the test data. These are both critical aspects to feature normalization. If you don't apply the same scaling to training and test sets, you'll end up with more or less random data skew, which will invalidate your results. If you prepare the scaler or other normalization method by showing it the test data, Instead of the training data, this leads to a phenomenon called data leakage. Where the training phase has information that is leaked from the test set for example like the distribution of extreme values for each feature in the test data. Which the learner should never have access to during training. This in turn can cause the learning method to give unrealistically good estimates on the same test set, we'll look more at the phenomenon of data leakage later in the course. One downside to performing feature normalization is that the resulting model on the transformed features may be harder to interpret. Again, in the end, the type of feature normalization that's best to apply can depend on the data set, learning task, and learning algorithm to be used. We'll continue to touch on this issue throughout the course. Okay, let's return to ridge regression after we've added the code for minmax scaling of the input features. We can see the significant effect of minmax scaling on the performance of ridge regression. After the input features have been properly scaled, rich regression achieves significantly better model fit with an r-squared value on the test set of about 0.6. Much better than without scaling and much better now than ordinary least squares. In fact, if you apply the same minmax scaling with ordinary least squares regression, you should find that it doesn't change the outcome at all. In general, regularization works especially well when you have relatively small amounts of training data compared to the number of features in your model. Regularization becomes less important as the amount of training data you have increases. We can see the effect of varying the amount of regularization on the scale, training and test data, using different settings for alpha in this example. The best r-squared value on the test set is achieved with an alpha setting of around 20. Significantly, larger or smaller values of alpha both lead to significantly worse model fit. This is another illustration of the general relationship between model complexity and test set performance that we saw earlier in this lecture. Where there's often an intermediate best value of a model complexity parameter, that does not lead to either under or over fitting. Another kind of regularized regression that you could use instead of ridge regression is called lasso regression. Like ridge regression, lasso regression adds a regularization penalty term to the ordinary least squares objective that causes the model w coefficients to shrink towards zero. Lasso regression uses a slightly different regularization term called an L1 penalty, instead of ridge regression's L2 penalty, as shown here. The L1 penalty looks kind of similar to the L2 penalty, in that, it computes a sum over the coefficients but it's sum over the absolute values of the w coefficients instead of a sum of squares. And the results are noticeably different. With lasso regression, a subset of the coefficients are forced to be precisely zero, which is a kind of automatic feature selection. Since with a weight of zero, the features are essentially ignored completely in the model. The sparse solution where only a subset of the most important features are left with non zero weights, also makes the model easier to interpret, in cases where there are more than a few input variables. Like ridge regression, the amount of regularization for the lasso regression is controlled by the parameter alpha, which by default is 1.0. Also like ridge regression, the purpose of using lasso regression is to estimate the w and b model coefficients. Once that's done, the prediction model formula is the same as for ordinary least squares, you just use the linear model. In general, lasso aggression is most helpful if you think there are only a few variables that have a medium or large effect on the output variable. Otherwise, if there are lots of variables that contribute small or medium effects, ridge regression is typically the better choice. Let's take a look at lasso regression in scikit-learn using the notebook, using our communities in crime regression dataset. To use lasso regression, you import the lasso class from sklearn.linear_model, and then just use it as you would use an estimator like rigde reggression. With some data sets, you may occasionally get a convergence warning. In which case you can set the max underscore inter attribute to a larger value, so typically at least 20,000, or possibly more. Increasing the max inter parameter will increase the computation time accordingly. In this example, we're applying lasso to a minmax scale version of the crime data set as we did for ridge regression. You can see that for alpha set to 2.0, only 20 features with non-zero weights remain, because with lasso regularization, most of the features are set to have weights of exactly zero. I've listed the features with non zero weights in order of their descending magnitude from the output. Although we need to be careful in interpreting any results for data on a complex problem like crime. The lasso regression results do help us see some of the strongest relationships between the input variables and outcomes for this particular data set. For example, looking at the top five features with non-zero weight that are found by lasso regression. We can see that location factors like percentage of people in dense housing, which indicates urban areas. And socioeconomic variables like the fraction of vacant houses in an area, are positively correlated with crime. And other variables like the percentage of families with two parents is negatively correlated. Finally, we can see the effect of tuning the regularization parameter alpha for lasso regression. Like we saw with ridge regression, there's an optimal range for alpha that gives the best test set performance that neither under or over fits. Of course this best alpha value will be different for different data sets and depends on various other factors such as the feature preprocessing methods being used. Let's suppose for a moment that we had a set of two dimensional data points with features, X0 and X1. Then we could transform each data point by adding additional features that were the three unique multiplication combinations of X0 and X1, so X0 squared, X0X1, and X1 squared. So we've transformed our original two dimensional points into a set of five dimensional points that rely only on the information in the two dimensional points. Now we can write a new regression problem that tries to predict the same output variable Y hat, but using these five features instead of two. The critical insight here is that this is still a linear regression problem. The features are just numbers within a weighted sum. So we can use the same least squares techniques to estimate the five model coefficients for these five features that we used in the simpler two-dimensional case. Now, why would we want to do this kind of transformation? Well, this is called polynomial future transformation, that we can use to transform a problem into a higher dimensional regression space. And in effect, adding these extra polynomial features allows us a much richer set of complex functions that we can use to fit to the data. So you can think of this intuitively as allowing polynomials to be fit to the training data instead of simply a straight line, but using the same least squares criterion that minimizes mean squared error. We'll see later that this approach of adding new features like polynomial features is also very effective with classification, and we'll look at this kind of transformation again in kernelIzed support vector machines. When we add these new polynomial features, we're essentially adding to the models ability to capture interactions between the different variables by adding them as features to the linear model. For example, it may be that housing prices vary as a quadratic function of both, the lot size that a house sits on, and the amount of taxes paid on the property, as a theoretical example. A simple linear model could not capture this nonlinear relationship, but by adding non-linear features like polynomial as to the linear regression model, we can capture this non-linearity. More generally, we can use other types of non-linear feature transformations beyond just polynomials. This is beyond the scope of this course, but technically these are called non-linear basis functions for regression, and are widely used. Of course, one side effect of adding lots of new features, especially when we're taking every possible combination of K variables, is that these more complex models have the potential for overfitting. So in practice, polynomial regression is often done with a regularized learning method like ridge regression. Here's an example of polynomial regression using psychic learn. There's already a handy class called polynomial features in the SKlearn.preprocessing module that will generate these polynomial features for us. This example shows three regressions on a more complex regression dataset that happens to have some quadratic interactions between variables. The first regression here just uses least squares regression without the polynomial feature transformation. The second regression creates the polynomial features object with degree set to two, and then calls the fit transform method of the polynomial features object on the original X F1 features to produce the new polynomial transformed features X F1 poly. The code then calls ordinary least squares linear regression. You can see indications of overfitting on this expanded future representation as the models R-squared score on the training set is close to 1 but much lower on the test set. So the third regression shows the effect of adding regularization via ridge regression on this expanded feature set. Now, the training and tests R-squared scores are basically the same with the test set score of the regularized polynomial regression performing the best of all three regression methods. I'd like to take a minute to show you a very useful and interesting plot that shows how the choice of the regularization parameter alpha affects the estimated coefficients of the regression model. I'll call this coefficient curve plot. So recall that if we want to fit a regularized regression model such as ridge or lasso regression to a set of points. We also specify a particular value of alpha to use that controls how much weight is given to the regularization penalty in the optimization objective. So, and this part of the slide, I've written down the optimization objective and you can see that we have the regularization parameter alpha that controls how much weight the regularization part of the objective gets. This sum of wj squares compared to the normal least squares part of the objective. So we specify alpha, we fit the model and like all regression scenarios, the result of fitting the model will be a set of estimated coefficients. So, in this example we're imagining that we have six model weights that goes, let's say with six different features in the input. And we're going to be analyzing a linear model that estimates these six coefficients W1 through W6 that we're fitting with ridge regression in this case. I'm going to ignore the coefficient B here on the plot since that doesn't contribute to the regularization penalty. Okay, so here is the coefficient curve plot for this example of ridge regression. On the x axis over here we have the value of alpha. Now note that I'm plotting this on a logarithmic scale because we're varying alpha quite widely between relatively small value and a relatively large value. So just take note of that logarithmic scale. On the y axis of this plot, we have the value of an estimated coefficient, and because we have six coefficients, I plotted six curves on this plot. So there's one curve for each estimated coefficient W1, W2, and so forth. Now this chart shows what happens to an estimated coefficient as we fit the model using different choices for alpha. In fact this coefficient curve plot shows every possible choice for alpha between 10 to the -1 down here, all the way up to 100. And we smoothly pretend that we can compute an infinite number of possible models using any choice of alpha we like, and this creates a curve as a function of alpha. For example, the plot shows that for ridge regression, if we pick alpha equal to 10 to the -1, then we get a model whose estimate for W1 will be just a little bit less than 50. If we pick alpha equal to 10, we look here at the values, this plot shows that the resulting model will have an estimated coefficient for W1 that's around 28. And it shows what the values of the other estimated coefficients will be as well. So you can see that as we increase our choice value for alpha, the value of the estimated coefficient W1 gets smaller and smaller, it shrinks and that's basically what retrogression does. It penalizes coefficient values that are larger using this squared penalty term, and thus it forces the model to shrink the estimated coefficient towards 0. The more weight that you give to that regularization term, the higher the value of alpha, the more you're emphasizing that the estimated coefficients must be small. And if we do this for all six coefficients, this plot shows the shrinkage effect that ridge regression has as we increase alpha. Here's the corresponding coefficient curve plot for the same linear model, but this time we fit it using lasso regression, which uses an L1 regularization penalty term. So you can see the L1 penalty term here. And just as in retrogression, the parameter alpha controls how much weight the objective places on the regularization part versus the ordinary least squares part. Notice that when alpha is very small, So like 10 to the -1. The values of the resulting estimated coefficients W1, W2 and so forth. On the left here are effectively the same as we saw in Ridge regression, right, because the regularization penalty alpha is very small in both cases. So we'll get something close to the ordinary least square solution. So Lasso and Ridge regression will produce models with estimated coefficients that are very similar when alpha is very small. But as we increase alpha, we can see this shrinkage effect on the estimated coefficients is very different for Lasso regression compared to Ridge regression. We can see how Lasso regression finds sparse sets of coefficients for higher values of alpha. For example, this time if we choose alpha to the10. So suppose we fit a model using alpha =10, the resulting estimated coefficients will be the values along this line. And you can see that the resulting coefficients for W2 here, W3, and W5, they've already shrunk to 0. So by the time we get to alpha= 10, the models that are being produced at alpha= 10 have W2, W3 and W5 =0. The only coefficients in the model that will have nonzero values will be the remaining coefficients W1, W4 and W6. Okay, so at alpha = 10, you can see that those are just ever so slightly nonzero. If we choose alpha =30, which is going to be somewhere here If we fit the model using Lasso regression with alpha set to 30, the only nonzero estimated coefficient at that point will be W1. So you can see that's why Lasso regression gives these very sparse solutions. You can see that's a really nice illustration of the Lasso shrinkage effect as a function of alpha. So this coefficient curves are a really nice way to show what's called the entire solution path for a model as a function of the regularization hyperparameter alpha. And that lets us compare how different regularization penalties shrink the estimated model coefficients in different ways.