In the last two videos, we learned about the most important idea of machinery, namely the idea of generalization as the purpose of learning. We found that for a general regression problem, the generalization error is given by the expected square loss, where the expectation is taken over all data, both seen and unseen. Lastly, the no free lunch theorem stated that a generalization error for a given machine learning algorithm is determined by properties of the data-generating process. So the key point here is that we should care about all data, both seen and unseen. But how can we do it if all we have is one particular dataset? By definition, we can't have any unseen data. What we do is, we split our dataset into two parts. Let's say we have our whole data in the form of a design matrix X. The design matrix stores the data row wise so that each record has D features and there are N rows in total. For each record, we have an output Y. So a vector Y of two outputs is a second part of our data. The key assumption we make here is that all pairs of rows in X and Y are I.i.ds. That is, independent identically distributed samples from some unknown data generating distribution P data. Now let us split the data matrix X into two data matrices, that we call X_train and X_test of the same width D and length N_train and N_test. The same way we split the vector Y of length N into two vectors, Y_train and Y_test of lengths N_train and N_test. Now as all entries in the original set of pairs of X and Ys are I.i.d samples from some data generating distribution P_data, the same is true for both sets X_train, y_train, and X_test, y_test separately. Now we do a small cheek which is absolutely critical for most machine learning tasks. We simply withhold X_test and y_test from the process of training the model. That is, the model is first trained using only a training set X_train, y_train. In doing this, the modal fine tunes its parameters. So it's predicted values of Y match the values Y_train as close as possible. Now, when the model is trained, we use the second set of pairs, X_test and y_test to estimate the unknown generalization error using its empirical mean. That is, we run the model on yet unseen inputs X_test and y_test and compare predictions of the model with the true outputs y_test. Because the model didn't see or use the pairs X_test, y_test in training, this is a fair game. First, we care about the empirical error as an estimate for a true generalization error, and then we test performance of the model using the holdout set of pairs X_test and y_test. As long as all data ING IT IID samples from some true data generating distribution and as long as this set X_tests, y_test is large enough, such procedure gives a good estimation of the true and unknown generalization error. Now the main question is how to find the model with the smallest generalization error. To illustrate this process, let's consider this graph. On the x-axis, we plot what is called the modal capacity. We don't know an exact definition of this term, but we can think of modal capacity as conditions on the model architecture, that eventually determine the model ability to fit arbitrary functions. For example, a restriction to linear architecture, means that we can only consider the class of linear regressions. So we have the model capacity on the x-axis and on the y-axis, we will have either the train error or the test error. First, let's see the behavior of the train error when we increase the modal capacity. As we see this is a typically and monotonically decreasing function of model capacity. This is because by making a model progressively more complex, we can eventually fit any given training data set arbitrarily well. So a training set, training error always decreases with increased complexity of the model, and still reduces [inaudible]. Now, let us see a typical behavior of the test error. Here, we see something different. Initially, the test error drops roughly in sync with the drop of the train error. But then it typically reaches some minimum point after which it starts to grow. This behavior when the test error starts to grow at a certain level of model complexity is called over-fitting. A proper balance in this example is attained at this point where the test error is at the minimum. To the left of this point, our model under-fits because it's too simple, it has high bias as a result. To the right of this point, our model over-fits because it's too complex, leading to a high variance. The position and the height of the optimal point is appropriately of data. In particular because the test error picks any irreducible moistened data itself as we remember from the bias-variance decomposition. In other words, optimal model complexity needs to match data complexity in order to avoid either under-fit or over-fit. We can visualize this in the following diagram. Here, I plot the model capacity on the x-axis and the data complexity on the y-axis. We can have four scenarios in this plane. First, the simplest case, is when we have simple data and some simple model, for example linear regression with a few predictors. Such linear regression models can only approximate linear functions well. Therefore typically, have a low capacity. The honest scenario arises one simple models are applied to complex data. This would typically be a strong candidate for under-fitting. The other case happens when a highly complex data is meshed in complexity by a complex model. This is often the case in large-scale machine learning can take, such as for example for tasks of image especially cognition. Finally, there is the most dangerous quadrant in this plane, which I showed in red. This is a scenario when you over-fit. That is, your model is a way too complex for the data. Respectively, if your test error starts to send your signals that your model indeed over-fits. You may want to reduce the modal capacity, and there are multiple ways to reduce modal capacity in machine learning. If you only use linear regression, then one simple way to do this would be to reduce the number of predictors or features. But it turns out that there are many other ways to constrain model capacity, such as those briefly mentioned in this box on the right. This would be topics that we will discuss at length in our second course in this specialization. But for now, it's time to move on and consider our first machine learning algorithm.