Machine Learning - Unsupervised
What you have seen thus far is making the machine learn to find the solution to our problem. Regression trains a machine to predict a future value. Classification trains a machine to classify an unknown object into one of our categories. We have been training machines so they can predict Y based on our data X. However, training supervised learning with such large data sets and without estimating the categories would be difficult. Suppose the machine could analyze the big data running into several Gigabytes and Terabytes and tell us that it contains so many distinct kinds of information?
For example, consider the data of a voter. The machine can predict that a certain number of voters will vote for X political party and a certain number will vote for Y based on some inputs from voters (called features in AI terminology). Generally, we are asking the machine, given a huge set of data points, "What can you tell me about X?". Or, "What are five groups we can make from X?". It could even be as simple as “What three features occur together most frequently in X?”.
Unsupervised Learning is exactly what it sounds like.
Algorithms for Unsupervised Learning
We will now discuss one of the most widely used algorithms for unsupervised machine learning classification.
k-means clustering
The 2000 and 2004 Presidential elections in the United States were very close — very close. The largest percentage of the popular vote that any candidate received was 50.7% and the lowest was 47.9%. If a percentage of the voters had switched sides, the outcome of the election would have been different. In some cases, voters can switch sides when properly appealed to. When elections are close, these groups of people may be large enough to impact the outcome. How do you find these groups of people? How do you appeal to them with a limited budget?
Here's how it's done.
-
The first step is to collect information on people, either with or without their consent: any sort of information that might give a clue as to what is important to them and what will influence their voting behavior.
-
After that, you put this information into a clustering algorithm.
-
After that, you craft a message that will appeal to each cluster (you should start with the largest one).
-
Lastly, you deliver the campaign and measure its effectiveness.
Unsupervised learning is used to create clusters of similar things automatically. This is similar to automatic classification. In this chapter, we will look at a clustering algorithm called k-means. It can cluster almost anything, and the more similar the items in the cluster, the better. k-means finds ‘k’ unique clusters, and the center of each cluster is the mean of all values in that cluster.
Cluster Identification
It tells an algorithm, "Here are some data. Now group similar things together and tell me about those groups." The key difference from classification is that in classification you know what you are looking for, while in clustering you don't.
It is sometimes called unsupervised classification because it produces the same results as classification but without predefined classes.
Now that we're familiar with both supervised and unsupervised learning, we will learn about Artificial Neural Networks (ANN) in the next chapter.