Welcome to this course on Machine Learning Algorithms. My name is Denis Batalov, I've been with Amazon 13 years and currently work as a principle solutions architect specializing in Machine Learning, and I even have a PhD in this field. You are now ready to start diving into machine learning algorithms, exciting times. Many customers are struggling with understanding how to translate their business problems into an IT solution that somehow incorporates machine learning. So in this course, we're going to review the different types of machine learning algorithms and the problems they solve. Perhaps you've already heard about supervised learning, unsupervised learning, reinforcement learning, and deep learning. By the end of this course, you should be able to speak confidently about these categories of ML algorithms with your customers and help them determine the category that fits their problem. So let's get going. Before we can intelligently speak of Machine Learning, let's recall what artificial intelligence or machine intelligence is all about. A system exhibiting intelligent behavior normally needs to possess two fundamental faculties. First is the ability to acquire and systematize knowledge. This is relying on the so-called inductive reasoning or coming up with rules that would explain the individual observations. Of course simple facts or truths need to be acquired as well. But if a rule can be learned so that many truths can be derived from it, it would be easier to remember a single rule, wouldn't it? For example, as you hear me speak, you don't need to constantly remind yourself that there is no life person in front of you because you know how video recordings work. Now, the second faculty is inference or the ability to use the acquired knowledge to derive the truths when needed like making predictions, choose actions, or make complex plans. This ability relies on deductive reasoning, which was popularized so much by Conan Doyle. You heard me use the terms learning and predictions when describing these faculties, and this is of course no accident. All machine learning algorithms must possess them in some form. Early algorithms in the AI space were relying primarily on the second faculty of inference by having humans acquire and feed all the necessary knowledge into the machine. This unfortunately proved to be impossible for most practical problems, and that's why ML algorithms rule the day, machines learn automatically. We're now ready to discuss the different categories of ML based on how the machine learns and what it can infer. Currently, when people think of machine learning, they typically think of supervised learning because of its wide applicability and many successful applications. It's called supervised because there needs to be a supervisor, a teacher or trainer showing the right answers during the learning. No wonder we also call it training a machine learning model. A model, because the algorithm is effectively able to simulate or model the teacher. Oftentimes, the teacher is simply not there and all we're left with is just observations or data. Can something useful be learned from the data in such a case? You guessed it. This is the domain of the so-called unsupervised learning. One typical example from this category is a clustering algorithm, which divides the observations into what appear to be different clusters. We will see others later. I should point out that there exists so-called mixed or semi-supervised algorithms, but let's not over complicate things for now. Another kind of learning that has been gaining in popularity recently is the so-called reinforcement learning. In some sense, this type of an algorithm is attempting to solve the complete AI problem of building an agent capable of exhibiting entire intelligent behaviors, not just making isolated decisions. This is why it's an exciting area of research, but that's what makes it also difficult in applied practical settings. In reinforcement learning, the agent controlled by the algorithm is interacting with the possibly completely unknown environment and is learning optimal actions via trial and error. Here, there's no explicit teacher telling the agent what is the right action at any given time. Instead, the agent is getting an often delayed reward or penalty called reinforcement, and is designed to maximize long-term rewards. Think of playing a computer game with possibly unknown rules, but your goal is to get maximum points. Not surprisingly, this approach has been rather popular in game-play from early successes with simple games like tic tac toe in the 1960's, to bag gammon in 1990's, down to the very recent and highly publicized triumph of ML over the game of Go. Now, let's look closer at supervised learning. Suppose we want a machine learning algorithm to distinguish between circles and squares, a supervised learning algorithm would require many examples of both figures and a teacher who would tell it which is which. After the training is finished, a successful learning algorithm would be able to decide on its own whether any given figure is a circle or square with sufficient accuracy, hopefully substantially better than random guessing. It would do so even for circles or squares that it has never seen during training, and this is ultimately the power of supervised learning. Do the correct answer is always need to come from a human teacher? The notion of a teacher may be generalized to any complex system or phenomena that consists of machines, humans, or natural processes. Here, you see a Rube Goldberg machine to represent this complex system. You can view this system as a function that accepts input parameters and produces an outcome of sorts. The known outcomes form the so-called ground truths, and this set of historic observations is our training dataset. We then train the machine learning model by feeding it this training dataset. The resulting model is set to predict the same outcome based on previously unseen input parameters. Hopefully, the model prediction is the same or close to what the original system would have produced. The reason we're interested in building such models is that the original system is either impossible or expensive to procure and scale or takes too long to produce the outcome which we want to obtain sooner. If the predicted value is of binary nature as was the case with circles and squares, we say that the model is performing a binary classification, in other words, labeling the observation in two possible ways. This is just a special case of multi-class prediction where the data point can belong to one of many different and mutually exclusive classes such as circles, squares, or triangles. Mutual exclusivity is tricky to assure though as this picture makes clear. Sometimes it's entirely a matter of perspective. Now, if the variable being predicted is numeric, then the model is set to be solving a regression problem. In other words, determining the unknown value of the dependent variable based on input parameters. Let us now look at some examples. For instance, we might have historical records of volcano eruptions and various observations and measurements leading up to them. We can imagine training a machine learning model capable of predicting future eruptions. The teacher, with a source of truth in this case is simply nature itself. Another example is when we want the model to predict impending equipment failure so that we could make pro active repairs, this is known as predictive maintenance. Similarly, we may want to build a model that predicts which of our clients are about to stop purchasing our services possibly leaving for a competitor, this is known as customer churn prediction. In these examples, all we need to do to obtain a good training dataset with properly labeled observations is to systematically record historical observations, together with observed value of variables we ultimately want to predict. You will eventually know when a volcano erupts or when your equipment breaks down. That is only possible if the real system we're trying to model is already functioning on a regular basis and is easy to observe. If say we want to train a model to label people in photos as either smiling or frowning, then we would first need to have someone go through a large number of photos and label them manually. If such human labeling process was not already in place, obtaining a training dataset could be difficult and time-consuming. Fortunately, tools to crowdsource human decisions are available such as Amazon Mechanical Turk or similar offerings from AWS partners such as figure eight and others. So how many different supervised learning algorithms are there. There's literally hundreds of them in existence, so there's no point in mentioning them all. Instead, we can focus on a few families of algorithms that are most popular and have proven to be successful. One of the earliest and simplest algorithms is based on learning parameters of a linear function, likely in a multi-dimensional space. You've already seen an example of regression where we find the linear function that best fits the data. When it comes to predicting a category, or a class as in circles or squares, we typically want to find a hyperplane also known as a decision boundary that best separates the data samples belonging to the classes as shown in this picture. If there exists a linear surface that separates the two classes, we say that they are linearly separable. This is rarely the case in practice, however so some errors must be expected. To arrive at the binary classifier, we can apply a logistic function to the output of a linear combination of input parameters in order to restrict the values to the range from zero to one. This forms the basis of the so-called logistic regression algorithm. In fact, Amazon SageMaker has a built-in algorithm called linear learner, which is effectively a combination of linear and logistic regression. But other linear algorithms exist as well. You may have heard about Support Vector Machines or SVMs which strive to find a hyperplane with maximum margin between classes. More modern variants of the algorithm also introduce non-linearity with kernel functions. So strictly speaking would not belong to pure linear methods anymore. Perceptron is another rather simple linear classifier that forms the foundational unit of the so-called artificial neural networks, which we'll look at later in this course. As I pointed out earlier, in most practical settings, we're not dealing with linearly separable classes as demonstrated here. A circular decision boundary would work here, but so would a square-shaped one aligned with coordinate axis shown here. This is exactly the decision boundary used by algorithms that end up constructing the so-called decision trees. In order to make a classification, we start with the root of the tree and descend through the decision nodes until we arrive at a classification. In this example, points with X coordinate outside of the range from X1 to X2 are immediately classified as red. But for those in the range, we need to additionally consult the y-coordinate and check against Y1 and Y2 values. Instead of constructing a single tree, the algorithms from the tree family often construct many trees and combine their predictions in some ways. Algorithms such as Random Forest and XGboost are based on these approaches. In fact, Amazon SageMaker includes the XGboost algorithm. It is based on the idea of building a strong classifier out of many weak classifiers in the form of decision trees. Such an approach is called boosting. XGboost is a general-purpose supervised algorithm. Factorization Machines algorithm on the other hand works best when we deal with large amounts of spars data, such as the case with the problem of click prediction for online advertising or recommendations in general. Factorization Machines is also built into SageMaker. As usual, many other approaches and algorithms exist as well. For example, we could define the decision boundary to be polynomial as in circular or parabolic boundaries. Of course, we'll come back to neural networks later in this course. Let us now examine the unsupervised learning algorithms. Clustering is an especially popular type of an unsupervised algorithm. Given a collection of data points, we're trying to divide them into groups or clusters with the assumption that points belonging to the same cluster are somehow similar, whereas those belonging to different clusters are somehow dissimilar. We're still required to give some guidance to the algorithms such as specifying the number of clusters we're looking for. One problem with clustering algorithms is that we usually don't know how many clusters to pick. Here's an example of the result if we request just two clusters. But depending on various parameters and distance measures, a differently tuned algorithm might provide a different answer for the same two clusters requested. If for this dataset we request four different clusters instead, we might get something that looks like this. This points to another problem with such algorithms, it is ultimately up to us how to interpret the results and assign meaning to the discovered clusters. Another entirely different family of unsupervised algorithms attempts to detect anomalies or generally find outliers in the data. In this picture, we see the red green and blue lines representing different sensory readouts of the electrocardiogram, while the top magenta line corresponds to the anomaly scores produced by the algorithm after observing the data. The higher the score, the more pronounced the anomaly is. There's no explicit teacher labeling the historic data as anomalous, instead the algorithm learns on its own what normal looks like by simply observing the data. One anomaly detection algorithm was developed by scientists working at Amazon. So it's worth taking a closer look. It's called Random Cut Forest. The algorithm works by constructing a forest of the so-called random cut trees. Each tree is constructed by a recursive procedure which first surrounds the data points with a bounding box and then cuts or splits at along the coordinate axis by picking cut points randomly. The procedure is repeated until every point is sorted into a particular leaf of the tree. For full details you can read the paper presented on the International Conference for Machine Learning in 2016. Yet another example of unsupervised algorithm is the so-called topic modeling for documents with text content. The algorithm is the basis of the eponymous feature in the Amazon comprehend service. Given a collection of documents, news articles for example and the number of topics we would like to discover, the algorithm produces the top words that appear to define the topic, together with the weight that each of these words has in relation to the topic. In this case you see top words that likely pertained to sports. As with clustering in general, the approach is sensitive to the number of topics requested and its still requires us to assign the meaning to the discovered topic such as health, sports or politics. To summarize, Amazon SageMaker includes a popular clustering algorithm called k-means. It's an Amazon improvement over the well-known and scalable algorithm called Web-scale k-means. Another member of the unsupervised family is called Principal Component Analysis or PCA for short. Likewise available in SageMaker. It's especially useful in reducing the dimensionality of the dataset and is often used as a feature engineering step before passing the data to a supervised algorithms. Latent Dirichlet Allocation or LDA is the name of a particular topic modelings algorithm. A variant is used by the topic modeling feature of Amazon comprehend and the algorithm is also available in sage maker. The Random cut forest algorithm for anomaly detection is available in SageMaker as well as an Amazon Kinesis Data Analytics, for easy application to streaming data. Kinesis Data Analytics also features hotspot detection, another example of an unsupervised learning algorithm which you can use to identify relatively dense regions in your data. Okay, time for a quick quiz. Suppose we have a problem of predicting future values of a time series data. For example, suppose that we want to predict the future sales of some item. We have observed historical daily sales figures up to today and now want to predict what the sales figures would be in the future. Is this a supervised or unsupervised learning task? At first we might be tempted to answer that it is unsupervised, similar to Anomaly Detection, because there does not appear to be a teacher anywhere. But this is a bit of a trick question. You'd see all observations of historical sales leading to a particular day D in the past can be viewed as a training sample or observation with the correct label being the actual recorded sale for day D, eight in this case. In other words, just like with the historical volcano eruptions, the teacher here is the external environment. Finally, it's time for us to look at deep learning which is really a resurgence of neural networks. To understand them, let's first look at an element of neural networks called a neuron. Through a neuron you see in this diagram, a data sample is seen as a vector of numeric input values, which are then linearly combined with its weights. In other words, the neuron is computing a weighted sum and then applies a so-called activation function to produce output in the range from zero to one. With proper thresholding this can work as a binary classifier. Remember the perceptron that I mentioned earlier in this course? This is effectively what a perceptron is. Except, a single neuron would not be sufficient for practical classification needs. Instead, we could combine them into fully-connected layers to produce the so-called artificial neural networks also known as multilayer perceptrons. In a feedforward pass, the network is turning the input values into output, which forms the prediction of the algorithm. A special technique called backpropagation is then used to reduce the error between the desired or true output and the actual one produced by the network. Originally, these neural networks were inspired by some aspects of a biological nervous system but at this point there are really a computational apparatus for complex dependency modeling and function approximation. What I showed so far is an example of a traditional neural network prior to the advent of deep-learning. Since a few years back we have seen a resurgence of neural networks rebranded as deep-learning due to several important advances that relate to the algorithms themselves. Accumulation of large amounts of data for training and emergence of powerful specialized hardware such as GPUs, which are able to crunch this massive amount of data by passing it through very deep networks in terms of the sheer number of layers. Some of the results proved rather spectacular, enabling many exciting applications such as an image and speech recognition, natural language processing and so on. So how deep is deep? Here's an example that is rather puny by modern standards. In fact, networks with over 1,000 layers have been experimented with. Such networks have billions of parameters and many millions of images could be used in training. The shear computational power required to train such networks is not cheap to procure and this is where eta BFS comes handy with GPU based ec2 instances, housing powerful chipsets such as NVIDIA Volta in the P3 family. More importantly, you can distribute the training across multiple GPUs in order to speed it up and AWS makes it rather economical to set up the hardware cluster just for the time of training not having to worry about expensive hardware sitting idly afterwards. One important breakthrough in deep learning was the invention of the so-called Convolutional Neural Networks or CNNs for short, which are especially useful for image processing. The main idea behind CNNs is that it is able to relate nearby pixels in the image instead of treating them as completely independent inputs, which was the case prior to CNNs. A special operation called convolution is applied to the entire subsections of the image and more importantly the parameters of these convolutions are also being learned in the process. If several convolutional layers are stacked one after another, each convolutional layer learns to recognize patterns of increasing complexity as we move through the layers. We don't have time in this course to dive into the details of how CNN function. Our goal is to understand some of the common use cases. Of course recognizing objects and images and generally classifying images is a very common use case. But CNNs enabled many other exciting applications related to images. For example, they can be used for semantic segmentation or classification of individual pixels as belonging or not belonging to detected objects, a motorcyclist in this case. Furthermore, they have been used for other novel applications such as artistic style transfer, where one image, here a photo of a cat is modified by applying an artistic style to it which was previously extracted from another image typically a painting. In the bottom right corner, they have even been used to generate photos of cats. These cats look photorealistic and yet none of them actually existed in real life. If we take the output of a neuron and feed it as input to itself or neurons from previous layers, we are creating the so-called recurrent neural networks. It's as if the neuron remembers its output from the previous iteration thus creating a kind of memory. On the right-hand side you see just one unit of a more complex network called LSTM, which stands for Long Short-Term Memory. It is commonly used for speech recognition and translation. In fact, LSTMs are used as a building block to the so-called Sequence to Sequence modeling which is used in neural machine translation. A high-level architecture in the diagram shows how does include house input is being translated into the greenhouse. Amazon released an entire library called Sockeye for state of the art sequence to sequence modeling tasks that customers can use in their projects. We just looked at convolutional and recurrent neural networks which are the two most common families of neural networks. But as you can see here many network topologies exist and they're being studied. Amazon SageMaker conveniently provides a built-in algorithm for image classification based on Resnet, a kind of CNN, but it also provides a sequence to sequence algorithm, a neural topic modeling algorithm to complement Latent Dirichlet allocation and also DeepAR forecasting algorithm for time series prediction which we already looked at. Remember the quiz? So our deep learning algorithm's supervised or unsupervised? Well, they can be either. The algorithms shown in the slide are all supervised except for neural topic modeling which the icons on the left indicate. Deep learning algorithms have even been employed as a key component of a reinforcement learning algorithm. Well, this concludes our review of various Machine Learning algorithms. Hopefully, you've come to understand the different categories of Machine Learning algorithms and how they relate to the business problems they help solve. Thanks for listening. You can follow me on Twitter and please tune into other courses in this series.