In this tutorial, you’ll gain an understanding of what machine learning is and how Python can help you take on machine learning projects. Understanding what machine learning is, allows you to understand and see its pervasiveness.
In many cases, people see machine learning as applications developed by Google, Facebook, or Twitter. Many of these applications are complex and are often made up of many smaller models. But don’t let that scare you! This article aims to convince you to see how easy it is to build your own machine learning models. Many applications of machine learning are approachable and can be readily applied to your everyday work.
By the end of this tutorial, you’ll have a strong understanding of:
- What machine learning is (and what it isn’t)
- How supervised and unsupervised machine learning algorithms work
- What classification and regression machine learning models do
- What clustering and dimension reduction machine learning models do
- The general machine learning process for a small or large project
Table of Contents
What is Machine Learning?
Machine learning is the process of building mathematical models to help us better understand data. The term “learning” is used because machine learning models are given tunable parameters that allow them to adapt to data. Once the model has learned from previous data, it can be used to make predictions or to better understand new, unobserved data.
Before we dive further into this, let’s take a look at how “intelligent” applications were built in the past. Say you were building a computer program to detect spam messages in your email. You might notice that a lot of spam messages contained a certain word, such as “wire transfer”. You could build a rule using an if-else
statement to classify these messages as spam.
There are a lot of problems with this approach: first and foremost, not every message with the words “wire transfer” will actually be spam! Secondly, there are a lot more other spam type keywords out there. As spammers learn what keywords to avoid, you’ll need to consistently add to these words.
# Creating an inefficient and inaccurate spam filter:
def filter_messages(email):
if 'wire-transfer' in email:
return 'Spam!!'
elif 'local singles' in email:
return 'Spam!!'
# ... so many options
else:
return 'Probably not spam'
Modern-day spam filters use statistical and algorithmic models to predict whether an email should be classified as spam or not. Email services often have labeling tools to classify an email as spam or not. This generates new data based on that email to allow the algorithms to learn without being explicitly programmed!
For example, a modern-day spam filter will take into account how long an email is, the frequency with which certain words appear, how emails have been labelled in the past, etc. As you label an email as spam (or not spam), the algorithms can “learn” from this (or be trained) and be more accurate in the future!
Machine learning models can take what they learned, turn them into vectors (a concept you’ll learn about later), pass these vectors into an algorithm and return a predicted label! For example, an email’s text can be turned into certain properties (like length, frequency of words, presence of words, etc.) and have this fed into the algorithm. The Algorithm can then return a predicted label, such as either spam or not spam!
Categories of Machine Learning
At its most basic level, you can break machine learning down into two primary types: supervised learning and unsupervised learning.
Supervised learning
refers to the process of modelling data based on the relationship between features of the data and some “label” associated with that data point. Once this model has been created, the model can take new data to make predictions on what the new data represents.
Take the first image above as an example. There are two main inputs: training data and their associated labels. For example, a stack of emails and whether or not they are spam or not spam. These are turned into “vectors”, or mathematical representations of the data. This representation is then turned into an algorithmic model.
When new data is created, this data can then be represented in numerical ways. Since your model has been tuned to work with certain data, this data can be passed in to return a predicted label. Following the example of spam emails: you receive a new email and the model can determine whether or not it’s spam!
Supervised learning is often broken into two main domains:
Classification
where labels are returned into discete categoriesRegression
where labels are continuous
Meanwhile, unsupervised leaning
refers to the process of modelling features of a dataset without any provided labels. The reason this is “unsupervised” is that there is no determined output you’re providing the model. For example, you’re not hoping to label emails as spam or not spam, but rather letting the machine learning model determine the differences between data points.
Supervised Learning in Python
In this section, you’ll learn a bit more about the two primary domains of supervised machine learning: classification and regression. Classification generally refers to returning discrete categories of data, while regression refers to the process of returning some predicted continuous value.
Classification Machine Learning: Predicting Discrete Categories
In a classification machine learning problem, you are given a set of labelled points and use this data to classify a set of unlabelled points. Take a look at the sample data below. You’re given a set of two-dimensional data. The data are pre-labelled into two categories (in this case, represented by color).
Given a new point of data, you can plot that point on that graph. Visually, you may be able to tell which group that point should belong to. But how can a computer solve that problem for you? This is a classification
problem.
A Quick Overview of Training and Testing Models
You use the training data (e.g., the set above) to train the machine learning model. How you train the data is both up to you and to the dataset! Finding the best model to use is part of the art of machine learning. For example, you could draw a line between the two categories of data. Anything below the line is assigned one color, while anything above the line is assigned the other color.
This process is referred to as “training” the model. You build a model against a sample of test data (the “training” data). For example, you take a sample of, say, 60% of the data to train the model. You can then use the remaining 40% of the data to “test” the model. Since your data is prelabelled, you can get a specific metric of the model’s accuracy. From there, you can attempt to build other models and see if others are more effective.
For example, visually you can see that there are certain clusters of data. This can mean that you can measure the distance from a number of pre-labelled points. If a point is closer to a number of points from one color (i.e., it has more neighbours of one color), then the data point can be labelled to be that color.
Regression Machine Learning: Predicting Continuous Data
In a regression problem, the labels of your data are continuous variables. These types of models are quite popular because they can be easily and quickly fit to your data, and they’re very easy to understand. In their simplest form, linear regression models that plotting a line of best to two variables: a dependent and an independent variable. However, these models can also be extended to much more complicated data behavior.
In those cases the independent variable can be used to predict the value of the dependent variable, using a simple equation.
In this case, you could likely find a straight line to model this relationship. The function of the line would take the form of:
y = mx + b
In this case, m
is referred to as the slope of the graph and b
is referred to as the intercept of the function.
Similarly, these types of regression can take more than two dimensions. In the example below, there are three dimensions and the regressive model shows a plane of intercepts.
In this case, you develop a plane of information to estimate what value will be returned given some inputs. Of course, not all regression models can be accomplished using a linear model. This is where these models become more complex. Similarly, there are few times when two or three variables will be enough to predict the behavior of a phenomenon.
The important takeaway from this section is that regression is a form of supervised learning. You train a model using training data and you can evaluate its effectiveness using testing data. Once you are comfortable with the performance of your model, you can pass in some piece of information with an unknown label (such as the selling price of a house, the income a person has, someone’s weight, etc.) and be able to predict that label using the confluence of other variables.
Unsupervised Learning in Python
Supervised learning involved using a model of known labels to predict the label of new data. On the other hand, unsupervised learning
involves building models that describe some data without any reference to known labels. In the example above, where you classified emails as spam or not spam, you fed a label into the model that told the algorithm: “this is spam” or “this isn’t spam”.
In unsupervised learning, these labels aren’t applied. Instead, the models you build identify patterns in the pattern on their own!
Clustering Machine Learning: Labelling Unlabeled Data
One of the most common use cases for unsupervised learning is the concept of “clustering”. In this type of machine learning, data are automatically assigned to different discrete groups. Take a look at the example below:
In the first image, it’s easy to see that there are three clusters of data. However, unsupervised learning can learn how to apply these labels through machine learning algorithms for you. In many cases, there’ll be more than two features that feed into a clustering analysis and the results may not be as easy to tell apart.
One of these unsupervised clustering methods is the k-nearest neighbor algorithm. The algorithm looks at each point and determines the distance to it for k
other neighbors. By looking at proximity in space, i.e., looking at how closely related these data points are, the algorithm can attempt to cluster them.
Dimensionality Reduction: Finding the Best Features
Another common use case for unsupervised learning is the process of dimensionality reduction
. In the examples you’ve looked at above, there have been a limited number of features. However, in practice, you’ll often encounter datasets with hundreds, if not thousands, of features.
Not only does more features mean longer runtimes, but it can also mean that your model gets “overfit” to the training data. What this means, is that the model performs incredibly well for the data that you have, but not as well for new data.
Dimensionality reduction takes a dataset with many features and aims to reduce the features (or the dimensions) of the data while retaining the integrity of the dataset and the model as much as possible.
One such machine learning process is known as principal component analysis
. This type of algorithm is used to reduce redundancies in the data through feature extraction. What actually happens is that a large set of variables are transformed into a smaller set of variables that maintains most of the information from the larger dataset.
This, of course, comes at the expense of reducing accuracy. However, the beauty of principal component analysis is that it trades a little accuracy for significant simplicity. This allows machine learning models to learn much faster. Similarly, the models are able to perform much faster as well.
The Machine Learning Process
In this section, you’ll gain an understanding of the machine learning process. The process is deliberately simplified and abstracts many of the complexities of the actual work. That being said, much of the process is true regardless of the machine learning project you take on, whether large or small!
- Getting Data: The first step is to get data. This can take many different forms, such as getting historical data from open data sets, using your company’s data, or using real-time data from IoT systems
- Cleaning and Preparing the Data: This can be a fairly involved step. Up to 80% of a data scientist’s time is often spent on the pre-processing of data. Finally, the data are split into training and test datasets.
- Training Your Model: Finally, you get to build your model! This step involves picking the right type of machine learning model (such as classification or regression) and using the training data to develop a model.
- Testing Your Model: In this step, you test your model’s effectiveness against data that it hasn’t yet seen. Using the testing data, you can calculate a score of the effectiveness of your model.
- Improve Your Model: Now that you know how well your model performs, you can make adjustments. These adjustments can mean fine-tuning your parameters (such as through Principal Component Analysis) or choosing a different type of model altogether.
- Using Your Model: If you’re satisfied with your model, it can be deployed to production! This means using the model in the real world. This is where the effectiveness of your model really comes to light. You may find that you need to tweak certain elements, improve its speed, or broaden its domains.
Conclusion and Recap
In this post, you learned the basic theory of machine learning and how it can be used in Python. The section below provides a quick recap of everything you learned:
- Machine learning is the process of building models to better understand data
- Machine learning isn’t a building a hard-coded set of rules for a computer to follow.
- Supervised learning refers to building a model based on the relationship between features of data and some label associated with that data.
- Such models take the form of classification or regression
- Unsupervised learning, on the other hand, refers to the process of using models to describe data without any reference to known labels
- These models can allow you to cluster data in meaningful ways or help in reducing the dimensionality of datasets
Additional Resources
To learn more about related topics, check out the tutorials below: