Now it's time to introduce our first machine learning algorithm. And this algorithm will be, surprise, surprise, the famous linear regression, an algorithm that you might be well familiar with. We will go over the working of linear regression by viewing it as a typical machine learning algorithm with a task, performance measure, and other attributes of machine learning methods that we introduced earlier. To spice things only a little bit, we will consider a general multivariate case from the start. First, let's start with the task T. Our task here will be to predict the value of a scalar variable of y given a vector of predictors or features X-naught to XD-1, which will be a vector in D dimensional space RD. Now, to learn from experience, we are given a data set of pairs, X and ys, which we can split into two data sets, X train, y train and X test, y test, as before. Both these data sets are assumed to be IAD samples from some data-generating distribution P data. Next, consider a model architecture that we want to use. By this, we mean defining a space of functions of X that we try to fit to our data. We choose a linear architecture in this case. That is, we constrain ourselves to a class of linear functions of X. Here, y is a vector of N predicted values. And next is a design matrix of dimension N x D. And W is the vector of regression coefficients of lengths D. Such vector is also often referred to as a vector of weights in the machine learning parlay. Now, one example of such linear regression model would be a model that tries to predict daily returns of some stock. Let's say Amazon stock. While X will be a vector of returns of various market indexes, for example, S&P 500, the NASDAQ composite, VIX, and so on. Now we have to specify the performance measure. We will define it as the mean squared error, or MSE for short, calculated on the test set. So to compute it, we take the mean value of squared differences between model predictions Y hat and through various Y for O points in the test data set. The same thing can also be written more compactly as the rescaled squared equidistance between vectors Y hat and Y. Now we can substitute the above model equation instead of Y hat and write it in this form. The next question is how to find optimal parameters W. We proceed as we discussed in the last lecture and look for parameters W that minimize another objective, namely MSE error on the training set. Which would be the same equation as here but with X test replaced by X train and Y test replaced by Y train. The reason is that both mean squared errors and on both the train and test data estimate the same quantity, which is the generalization error. Now, to find optimal values of Ws, we need to set the gradient of the MSE train error to 0. This gives us this expression, which should be equal to 0 for the optimal vector of W. Now, let's submit an unessential constant multiplier, 1 over N train here, and write the gradient as a gradient of the norm of this vector, which we write as a scalar product of the vector with itself. Here, the symbol T stands for a transposed vector. Now, let me simplify this relation a bit, and omit the upper scripts here, because we don't need them. Let's just remember that in this calculation we use the training data only. Now, at the next step, we extend the scalar product and take the derivative of this whole expression with respect to W. As the first term is quadratic and W and the second is almost linear, while the last doesn't depend on W at all, it drops out upon a differentiation. And as a result, we get this expression. It should be equal 0 for an optimal vector W. Now the optimal value W is obtained by simply inverting this equation. So we get this final formula. This relation is also known as the normal equation in the regression analysis. Let's recall that X and y, that stand here, refer to the values in the training set. To get a sense of matrix operations involved here, it's also useful to visualize this relation as shown here. Now, given estimated vector W, we can do a few things with it. First, we can compute the training or in-sample error by multiplying X train by W from the left. If we substitute this expression for W, we see that it can be expressed as a product of a matrix H, made of the data matrix X train and vector y train. This matrix stage is sometimes called the H hat matrix. You can check that this matrix is a projection matrix. In particular, it's symmetric and idempotent, meaning that its square equals the matrix itself. Another thing to do with the estimated vector W is to use it to make predictions. For example, if we use the test data set to predict out of sample, the answer is given by the product as X test and vector W. Now, note something interesting in this relation. We see that the expression for W includes the inverse of the product of X transposed and X matrices. If the data matrix X contains columns that are nearly identical or very strongly correlated, that can lead to a situation when this product will be a nearly degenerate matrix. Another common case that leads to a nearly degenerate matrix happens when one column can be expressed as a linear combination of other columns. And this is called multicollinearity phenomenon in statistics. That is the determinant of this matrix. In all such cases, we've been numerically very close to zero. This could lead to numerical instabilities or infinities in predicted values. There are multiple ways to prevent such instabilities in linear regression. One of them is to scan the set of predictors to rule out possible multicollinearity. But there are also some other methods as well. In particular, in the next video, we will talk about equalization as a way to provide a better out-of-sample performance in supervised learning algorithm.