In this module, we will explore clustering machine learning algorithms. We'll discuss what they're used for, how they work, and the intuition behind how they work. We'll discuss and focus on how they can help us solve and provide insight to business problems. Finally, we'll use R and RStudio to actually use the algorithms with realistic data to provide insight on a business problem. Machine learning has many definitions, but generally refers to solving a problem by gathering data and then using a machine to follow an algorithm that builds a statistical model based upon the data to gain actionable information from the data. Put another way, a machine learning algorithm is like a program with instructions. When you apply the algorithm to data, it creates a model. The model has new data along with instructions for how to make predictions with the new data. In this module, we're focusing on clustering algorithms. These algorithms are frequently called unsupervised learning because they're able to work with unlabeled data. Data that have no independent variable or a target to variable of interest. Thus, clustering algorithms are different from other algorithms such as regression, decision trees, and k-nearest neighbors because there's no truth that you're hoping to predict. Rather than predicting the future or determining the effect of key variables on an outcome, you're exploring the data, learning from it and seeing how it all fits together in order to solve a business problem or gain actionable business insight. Like most of machine learning, clustering algorithms come into the data analytics project workflow towards the second half of the process, in the data modeling stage and sometimes the data exploration stage of the workflow. Thus, the business analytics analyst has acquired clean and explore the data and is now using machine learning to extract business insights. Thus, clustering algorithms might be an end unto themselves, but they also might be input for future machine learning. That is, clustering algorithms might serve as finding labels for later classification algorithms and other supervised learning. Thus, both types of algorithms can potentially work together. This learning where both supervised and unsupervised learning is occurring is called semi-supervised learning. Regression and classification algorithms have a general procedure that they follow. While this is generally true for unsupervised learning, there are some differences. Certainly, all data needs to be extracted, transformed, loaded, and scrubbed to some extent, and algorithms selection still needs to happen of course. What's clearly different between supervised and unsupervised learning as that the model training and model fitting does not make sense for clustering algorithms. While the model will learn from the data and lead to important conclusions, there's no truth in unsupervised learning. Model evaluation is also different. But while training the classification models can be compared to hold out testing data for accuracy, clustering algorithms are evaluated much more and whether they make sense and lead to valuable insight. Thus overall, clustering algorithms are more subjective. Of course, if you love data, this just means that they're even more fun. Fewer rules, less supervision, and it's everything your rebellious teenager or data crazed analysts could ever want. Of course, as we introduce each of the machine learning algorithms that we'll use, we'll discuss the specific business problem that they're going to help us solve. Clustering algorithms are helpful anytime we have a business problem with no documented outcome or solution. Here are several examples: Define types of music or movies and gather similar types together or identify and group types of pages on the Internet. Another is segmenting customers into groups with similar characteristics or buying habits for targeted marketing campaigns. Or identifying types of social media responses and comments about accompany; positive comments, native comments, angry customers, happy customers, etc. Another is dimensionality reduction, simplifying a large datasets by grouping similar features into categories. As always, we encourage you to open RStudio and code along with us. Analytics tools and skills can only be learned by doing. Consider starting with a blank RStudio Notebook and coding everything by hand as you follow along. The more you dig in and get your hands dirty, the quicker you'll master this tool and be able to use it in your daily work. Our goal is to give you a solid foundation for your knowledge so that you can continue to build analytic skills and tools.