In this tutorial, you’ll learn how to use GridSearchCV for hyper-parameter tuning in machine learning. In machine learning, you train models on a dataset and select the best performing model. One of the tools available to you in your search for the best model is Scikit-Learn’s GridSearchCV
class.
By the end of this tutorial, you’ll have learned:
- Why hyper-parameter tuning is important in building successful machine learning models
- How
GridSearchCV
is an incredible tool in exploring the hyper-parameters of your dataset - What the limitations of
GridSearchCV
are
Want to learn about a more efficient way to optimize hyperparameters? You can optimize and speed up your hyperparameter tuning using the Optuna library.
Table of Contents
Hyper-Parameters in Machine Learning
Before we dive into tuning your hyper-parameters, let’s take a moment to recap what the differences between parameters and hyper-parameters are in a machine learning model.
Parameters in a machine learning model refer to the variables that an algorithm itself produces (such as a coefficient) to produce a prediction. These parameters are not set or hard-coded and depend on the training data that is passed into your model. Because of this, they’re likely to change when your data changes.
On the other hand, hyper-parameters are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. For example, in a k-nearest neighbour algorithm, the hyper-parameters can refer the value for k
or the type of distance measurement used.
In short, hyper-parameters control the learning process, while parameters are learned.
This is where the “art” of machine-learning comes into play. The choice of your hyper-parameters will have significant impact on the success of your model. Being able to tune your model is finding what the best hyper-parameters are.
Hyper-Parameter Tuning in Machine Learning
Hyper-parameter tuning refers to the process of find hyper-parameters that yield the best result. This, of course, sounds a lot easier than it actually is. Finding the best hyper-parameters can be an elusive art, especially given that it depends largely on your training and testing data.
As your data evolves, the hyper-parameters that were once high performing may not longer perform well. Keeping track of the success of your model is critical to ensure it grows with the data.
One way to tune your hyper-parameters is to use a grid search
. This is probably the simplest method as well as the most crude. In a grid search, you try a grid of hyper-parameters and evaluate the performance of each combination of hyper-parameters.
How does Sklearn’s GridSearchCV Work?
The GridSearchCV
class in Sklearn serves a dual purpose in tuning your model. The class allows you to:
- Apply a grid search to an array of hyper-parameters, and
- Cross-validate your model using k-fold cross validation
This tutorial won’t go into the details of k-fold cross validation. The process pulls a partition from the available data to create train-test values. It repeats this process multiple times to ensure a good evaluative split of your data.
Let’s explore how the GridSearchCV class works in Sklearn:
# Exploring the GridSearchCV Class
GridSearchCV(
estimator=, # A sklearn model
param_grid=, # A dictionary of parameter names and values
cv=, # An integer that represents the number of k-folds
scoring=, # The performance measure (such as r2, precision)
n_jobs=, # The number of jobs to run in parallel
verbose= # Verbosity (0-3, with higher being more)
)
From the class definition, you can see that the function that takes a number of parameters. Let’s explore these in a bit more detail:
estimator=
takes an estimator object, such as a classifier or a regression model.param_grid=
takes a dictionary or a list of dictionaries. The dictionaries should be key-value pairs, where the key is the hyper-parameter and the value are the cases of hyper-parameter values to test.cv=
takes an integer that determines the cross-validation strategy to apply. IfNone
is passed, then 5 is used.scoring=
takes a string or a callable. This represents the strategy to evaluate the performance of the test set.n_jobs=
represents the number of jobs to run in parallel. Since this is a time-consuming process, running more jobs in parallel (if your computer can handle it) can speed up the process.verbose=
determines how much information is displayed. Using a value of 1 displays the time for each run. 2 indicates that the score is also displayed. 3 indicates that the fold and candidate parameter are also displayed.
In the next section, we’ll take on an example to see how the GridSearchCV class works in sklearn!
Sklearn GridSearchCV Example
Now that you have a strong understanding of the theory behind Scikit-Learn’s GridSearchCV, let’s explore an example. For this example, we’ll use a K-nearest neighbour classifier and run through a number of hyper-parameters.
Let’s load the penguins
dataset that comes bundled into Seaborn:
import pandas as pd
from seaborn import load_dataset
# Load dataset and drop any missing values
df = load_dataset('penguins')
df = df.dropna(how='any')
# Create features and target variables
X = df.drop(columns=['species', 'island', 'sex'])
y = df['species']
In the code above, we imported Pandas and the load_dataset()
function Seaborn. We dropped any missing records and split the data into a features array (X
) and a target Series (y
). Let’s see what these two variables look like now:
print(X.head())
# Returns:
# bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
# 0 39.1 18.7 181.0 3750.0
# 1 39.5 17.4 186.0 3800.0
# 2 40.3 18.0 195.0 3250.0
# 4 36.7 19.3 193.0 3450.0
# 5 39.3 20.6 190.0 3650.0
We can see that we have four columns at our disposal. Similarly, let’s look at what y
looks like:
print(y.head())
# Returns:
# 0 Adelie
# 1 Adelie
# 2 Adelie
# 4 Adelie
# 5 Adelie
# Name: species, dtype: object
Now that we have our target and features arrays, we can split the data into training and testing data. For this, we’ll use the train_test_split()
function and split the data into 20% testing data.
# Splitting your data into training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 1234
)
From there, we can create a KNN classifier object as well as a GridSearchCV object. For this, we’ll need to import the classes from neighbors
and model_selection
respectively. We can also define a dictionary of the hyper-parameters we want to evaluate.
A k-nearest neighbour classifier has a number of different hyper-parameters available. In this case, we’ll focus on:
n_neighbors
, which determines the number of neighbours to look atweights
, which determines whether to weigh the distance of each neighbourp
, which determines the type of distance measure to use. For example,1
would imply the use of the Manhattan Distance, while2
would imply the use of the Euclidian distance.
Let’s create a classifier object, knn
, a dictionary of our hyper-parameters, and a GridSearchCV object:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
knn = KNeighborsClassifier()
params = {
'n_neighbors': [3,5,7,9,11,13],
'weights': ['uniform', 'distance'],
'p': [1,2]
}
clf = GridSearchCV(
estimator=knn,
param_grid=params,
cv=5,
n_jobs=5,
verbose=1
)
At this point, you’ve created a clf
object, which is your GridSearchCV object. At this point, we’ve really just instantiated the object. We still haven’t done anything with it in particular.
Let’s apply the .fit()
method to the object, by passing in our training data:
# Fitting our GridSearchCV Object
clf.fit(X_train, y_train)
# Returns:
# Fitting 5 folds for each of 24 candidates, totalling 120 fits
# [Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
# [Parallel(n_jobs=5)]: Done 50 tasks | elapsed: 1.9s
# [Parallel(n_jobs=5)]: Done 120 out of 120 | elapsed: 1.9s finished
We can see that, because we instructed Sklearn to be verbose, that our entire task took 1.9s and ran 120 jobs!
At this point, our object contains a number of really helpful attributes. One of these attributes is the .best_params_
attribute. This attribute provides the hyper-parameters that for the given data and options for the hyper-parameters.
# Printing the best parameters
print(clf.best_params_)
# Returns:
# {'n_neighbors': 11, 'p': 1, 'weights': 'distance'}
This indicates that it’s best to use 11 neighbours, the Manhattan distance, and a distance-weighted neighbour search.
Do You Need to Split Data with Sklearn GridSearchCV?
An important topic to consider is whether or not we need to split data into training and testing data when using GridSearchCV. The reason this is a consideration (and not a given), is that the cross validation process itself splits the data into training and testing data.
By first splitting our dataset, we’re effectively reducing the data that can be used by GridSearchCV
. There are polarized opinions about whether pre-splitting the data is a good idea or not.
In general, there is potential for data leakage into the hyper-parameters by not first splitting your data. By reserving a percentage of records for your true testing of the model, you’re able to get a more representative view of whether or not the model actually performs effectively.
Limitations of Sklearn GridSearchCV
At first glance, the GridSearchCV
class looks like a miracle. It automates some very mundane tasks and gives you a good sense of what hyper-parameters will work best for your model.
That said, there are a number of limitations for the grid search:
.best_params_
doesn’t show the overall best parameters, but rather the best parameters of the ones you passed in to search.- The process can end up being incredibly time consuming. When we fit the data, we noticed that the method ran through 120 instances of our model! Imagine running through a significantly larger dataset, with more parameters.
The reason that this required 120 runs of the model is that each of the hyper-parameters is tested in combination with each other. This is then multiplied by the value of the cross validations that are undertaken.
In our case, we tested with:
- 6 neighbours
- 2 distances
- 2 weights
- 5 cross validations
This amounts to 6 * 2 * 2 * 5 = 120
tests.
Conclusion
The GridSearchCV class in Scikit-Learn is an amazing tool to help you tune your model’s hyper-parameters. In this tutorial, you learned what hyper-parameters are and what the process of tuning them looks like. You then explored sklearn’s GridSearchCV class and its various parameters. Finally, you learned through a hands-on example how to undertake a grid search. You also learned some of the pitfalls of the sklearn GridSearchCV class.
Additional Resources
To learn about related topics, check out some related articles below:
Great example thanks! Very helpful. Fyi your X_train, y_train split is out of order.
your code
X_test, X_train, y_train, y_test = train_test_split(
just need to switch the X_train & X_test
X_train, X_test, y_train, y_test = train_test_split(
Thanks so much for catching this, Micah! I have updated the code on the page 🙂
How can one be productive if everything has to be coded in cli all over again and again (already discovered hot water) ? To estimate proper hyperparameters u have to learn to code in order to use it for warriors scenarios ? Why isn’t there some simple app with gui to test hyperparameters against dataset ? In order to use this efficiently i have to invest 2+ years into code learning but i need the proper hyperparameters now not in 2+ years. Seems to me developers intentionally keeping this hard to use.
Hey Steve! Thanks for your comment. I recommend saving this into .py files and then you can lightly modify across projects.