We'll start the discussion of linear models with the simple linear regression. Many of you are probably familiar with linear regression. Shows up in almost every imaginable field, everything from economics to many branches of the sciences. A linear regression model assumes a linear relationship between the input and output. The inputs being the Data features that we've defined and the output being the target we're trying to predict. This relationship is defined by a set of coefficients which are multipliers of each of the input features. If linear regression is so simple and common, why are we spending time talking about it? There's actually a couple answers to that question. The first is that even though it is simple, linear regression forms the basis of many of the more complex machine-learning models that we use. In particular, neural networks, which we'll talk about in a later lesson, really are founded on the basis of the simple linear regression. The regressions can also be surprisingly effective models on certain situations if they're properly used. They also make a great first model to apply to get a benchmark or sense of expected performance that you might hope to achieve on a particular machine learning task. By the way, I always recommend when you're working on a modeling task, to start with a simple model like a linear regression. Apply that as a first step and see what performance that gives you. Then once you move on to more complex algorithms, you can compare them back to your original benchmark and see whether you're really making an improvement or not. Finally, one of the really nice things about linear regression is that it's highly interpretable and it's very easy for us to understand the relationships between the inputs and the outputs in the model that we're building. How does a simple linear regression model work? Let's take the example we were working on before of predicting sale prices for homes. If we were building a simple linear regression involving a single variable, which is the number of bedrooms, we might provide that variable, the bedrooms into a model and as an output from our model, we would be predicting the home price. Our model might look something like this. y=W_0 + W_1X. W_0 is what we call the bias term. Or you can think of this as the y-intercept. If all of the features, or in this case the single feature we have was zero, what would the y value be. We call this again the bias. W_1 would be the coefficient, or sometimes also called the weight of the variable x, which represents the number of bedrooms in our house. This would be the multiplier of that feature to calculate the total value for our target sale price. Let's now move from the simple linear regression model to the multiple linear regression model. In this case, we have more than one feature. In fact, we have as many features as we would like to put into our model. We might add additional features in this case, such as square footage of our home, the school district, or the neighborhood our home is in. Again, we represent this with an equation which contains that bias term W_0. But now we have multiple coefficients, one for each of our input features. We may have a coefficient W_1, representing the weight of the number of bedrooms in calculating the final target sale price. W_2 might be the coefficient of the square footage of our home. W_3 might represent the coefficient of the school district that we're in, and so on. We add all these up. The coefficients are weights times the values of the features to calculate our y value or our target sale price. When we train a linear regression model, really what we're doing is learning the optimal values of these coefficients or weights that can effectively model the relationship between the input features and the output target. The first step in identifying the optimal values of those coefficients or weights, is to calculate the total error of the model. We will then alter the coefficients in a way that hopefully reduces that total error to the point where we've now minimized our total error. How do we calculate total error of our model? Well, the error for any given point, in this case, any given house for sale, is the actual price that, that homes sold for minus the predicted sale price. Or in mathematical notation, we call it y minus our prediction, which we call y-hat. Alternatively, for computational convenience, we often define error in terms of what we call Sum of Squared Errors or SSE. Sum of Squared Error, SSE is calculated as the sum of all the predictions minus the actuals squared or y-hat minus y squared and summed up over all the data points that we have. When we build our model, what we're really trying to do is seek the coefficients that can minimize that total value of the Sum of Squared Error. SSE, in modeling terminology is called our cost function, or also called a loss function. In this case, our cost function, or SSE is the sum of the y-hat minus y squared for every data point. Again, when we're training our linear regression model, we're seeking to find the values for those coefficients or weights that minimize the total of our cost function. To do this, we use the training data, the inputs x's and outputs y's available to us and we solve for the weights or coefficients that result in the minimum of the cost function. In the case of linear regression, we can usually do this using a closed form solution. In other types of models, we apply the same strategy, but often there is no closed form solution, so we use some more complex methods for calculating the values that result in the minimum of the cost function. Many people will think that linear regression really only works when there's a linear relationship between the inputs and the outputs. In reality, you can also model nonlinear relationships between inputs and outputs. In order to do that, what we do is transform an input feature by some nonlinear transformation function and create a new feature which we then use as an input to a model. For example, we may take an input feature x to a certain power x squared or x cubed for example. Or we may take the log of x and we'll create that as a new input feature. We'll feed that into our model and now we're able to better capture some of those nonlinearities of the relationship between the inputs and our outputs. When we do this, it's called polynomial regression. There's actually an unlimited number of transformations that we can apply. Let's look an example of when this comes in handy. In this case, our objective of our modeling task is to predict the fuel efficiency of cars given the horsepower of the engine. You could see here on this slide that I fitted a simple linear regression to horsepower. Looks like it does an okay job at capturing the variability in the pattern that we see in the output miles per gallon. But there's certainly room for improvement. On this screen I've displayed, the Mean Squared Error, as we can see for the training set and the test set. Now let's take a look at what happens when we use a nonlinear transformation and apply polynomial regression to the same task. In this case, I've taken horsepower cubed and using that as the input to my model. I'm now predicting miles per gallon based on a single input horsepower cubed. As we can see, our model is doing a much better job at capturing that non-linear relationship between horsepower and miles per gallon and as a result, the Mean Squared Error on both our training set and our test set has significantly improved.