Welcome to Regularization. In this video, you will learn what regularization is and when you should use it. Additionally, you will learn about two widely used regularization techniques, ridge and least absolute shrinkage and selection operator (or lasso) regression. Regularization is a way to handle the problem of overfitting. It is a technique you can use to reduce the complexity of the model by adding a penalty on the different parameters of the model. After it is applied, the model will be less likely to fit the noise of the training data and will improve the generalization abilities of the model. So, regularization is a way of avoiding overfitting by restricting the magnitude of model coefficients. In machine learning and statistics, a common task is to fit a model to a set of training data. You can later use this model to make predictions or classify new data points. When the model fits the training data but does not have good predicting performance and generalization power, you have an overfitting problem. Regularization is a technique used to avoid this overfitting problem. The idea behind regularization is that models that overfit the data are complex models that have, for example, too many parameters. There are a few methods of regularizing linear models including ridge (or L2) regularization, lasso (or L1) regularization, and elastic net, which is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. (You’ll learn more about elastic net in the lab.) Assume the dots in this graph are your data, and you have divided them into two parts. The two red dots are the training data, and the remaining green dots are the test data. You fit a red line between the two red dots using least squares. Since the red line overlaps the two red points, the minimum sum of squared residuals equals zero. The sum of squared residuals for the two red points is small, but the sum of squared residuals for the green points, which are the test data, is large. The red line is overfit on the training data and has high variance on the test data. The main idea behind regularization is to find a new line that doesn't fit the training data too well by introducing a small amount of noise (or bias) into the new line. In return for the addition amount of noise (or bias), you get a significant drop in variance. In other words, by starting with a slightly worse fit, with regularization, the new regression can provide better long-term predictions. When using least squares to determine the equation ArrDelayMinutes equals intercept plus slope times DepDelayMinutes, the function minimizes the sum of the squared residuals. When using the ridge regression function to determine the equation, the function minimizes the sum of the squared residuals plus lambda times weight squared. When using the lasso regression function to determine the equation, the function minimizes the sum of the squared residuals plus lambda times the absolute value of weight. In summary, to find the best model, the common method in machine learning is to define a loss or cost function that describes how well the model fits the data. The goal is to find the model that minimizes this loss function. The idea is to penalize this loss function by adding a complexity term that would give a bigger loss for more complex models. In the case of polynomials, you can use the square sum of the polynomial parameters. In the visualization on the previous slides, you can increase or decrease the value of Lambda to penalize complex models more or less. This way, for large lambda values, models with high complexity are ruled out. And for small lambda values, models with high training errors are ruled out. The optimal solution lies somewhere in the middle. Lasso is similar conceptually to ridge regression. It also adds a penalty for non-zero coefficients, but unlike ridge regression, which penalizes sum of squared coefficients (the L2 penalty), lasso penalizes the sum of their absolute values (the L1 penalty). As a result, for high values of lambda, many coefficients are exactly zeroed under lasso, which is never the case in ridge regression. Let’s look at an example of modeling ridge regression with tidymodels. First, create a recipe() that includes the model formula. You could preprocess the data more in this step, but the data here is already preprocessed. Next, use the linear_reg() function from the tidymodels library to specify the model. “Penalty” is the value of lambda. ”Mixture” is the proportion of L1 penalty. For ridge regression, specify Mixture = 0. This means there is no L1 penalty and only the L2 penalty is used. For lasso regression, specify Mixture = 1. Next, create a workflow object so you can more conveniently combine pre-processing, modeling, and post-processing requests. Finally, add the ridge model and fit the model. To view the result of the fitted ridge regression model, use the pull_workflow_fit() function. There are two results columns. The estimate column contains the estimates of the coefficients learned by the model. Penalty contains the value of lambda, which in this example is 0.1. This model focused on ridge regression. To get more information about lasso regression, check out the lab. In this video, you learned about regularization, a technique you can use to avoid overfitting by restricting the magnitude of model coefficients. It works by adding a penalty on the different parameters of the model to reduce its complexity. By manipulating the lambda and weight parameters, you can find the model with the right balance.