Skip to content

Hyper-parameter Tuning with GridSearchCV in Sklearn

Hyper-Parameter Tuning in Scikit-Learn Sklearn GridSearchCV Cover Image

In this tutorial, you’ll learn how to use GridSearchCV for hyper-parameter tuning in machine learning. In machine learning, you train models on a dataset and select the best performing model. One of the tools available to you in your search for the best model is Scikit-Learn’s GridSearchCV class.

By the end of this tutorial, you’ll have learned:

  • Why hyper-parameter tuning is important in building successful machine learning models
  • How GridSearchCV is an incredible tool in exploring the hyper-parameters of your dataset
  • What the limitations of GridSearchCV are

Hyper-Parameters in Machine Learning

Before we dive into tuning your hyper-parameters, let’s take a moment to recap what the differences between parameters and hyper-parameters are in a machine learning model.

Parameters in a machine learning model refer to the variables that an algorithm itself produces (such as a coefficient) to produce a prediction. These parameters are not set or hard-coded and depend on the training data that is passed into your model. Because of this, they’re likely to change when your data changes.

On the other hand, hyper-parameters are variables that you specify while building a machine-learning model. This means that it’s the user that defines the hyper-parameters while building the model. For example, in a k-nearest neighbour algorithm, the hyper-parameters can refer the value for k or the type of distance measurement used.

In short, hyper-parameters control the learning process, while parameters are learned.

This is where the “art” of machine-learning comes into play. The choice of your hyper-parameters will have significant impact on the success of your model. Being able to tune your model is finding what the best hyper-parameters are.

Hyper-Parameter Tuning in Machine Learning

Hyper-parameter tuning refers to the process of find hyper-parameters that yield the best result. This, of course, sounds a lot easier than it actually is. Finding the best hyper-parameters can be an elusive art, especially given that it depends largely on your training and testing data.

As your data evolves, the hyper-parameters that were once high performing may not longer perform well. Keeping track of the success of your model is critical to ensure it grows with the data.

One way to tune your hyper-parameters is to use a grid search. This is probably the simplest method as well as the most crude. In a grid search, you try a grid of hyper-parameters and evaluate the performance of each combination of hyper-parameters.

How does Sklearn’s GridSearchCV Work?

The GridSearchCV class in Sklearn serves a dual purpose in tuning your model. The class allows you to:

  1. Apply a grid search to an array of hyper-parameters, and
  2. Cross-validate your model using k-fold cross validation

This tutorial won’t go into the details of k-fold cross validation. The process pulls a partition from the available data to create train-test values. It repeats this process multiple times to ensure a good evaluative split of your data.

Let’s explore how the GridSearchCV class works in Sklearn:

# Exploring the GridSearchCV Class
GridSearchCV(
    estimator=,     # A sklearn model
    param_grid=,    # A dictionary of parameter names and values
    cv=,            # An integer that represents the number of k-folds
    scoring=,       # The performance measure (such as r2, precision)
    n_jobs=,        # The number of jobs to run in parallel
    verbose=        # Verbosity (0-3, with higher being more)
)

From the class definition, you can see that the function that takes a number of parameters. Let’s explore these in a bit more detail:

  • estimator= takes an estimator object, such as a classifier or a regression model.
  • param_grid= takes a dictionary or a list of dictionaries. The dictionaries should be key-value pairs, where the key is the hyper-parameter and the value are the cases of hyper-parameter values to test.
  • cv= takes an integer that determines the cross-validation strategy to apply. If None is passed, then 5 is used.
  • scoring= takes a string or a callable. This represents the strategy to evaluate the performance of the test set.
  • n_jobs= represents the number of jobs to run in parallel. Since this is a time-consuming process, running more jobs in parallel (if your computer can handle it) can speed up the process.
  • verbose= determines how much information is displayed. Using a value of 1 displays the time for each run. 2 indicates that the score is also displayed. 3 indicates that the fold and candidate parameter are also displayed.

In the next section, we’ll take on an example to see how the GridSearchCV class works in sklearn!

Sklearn GridSearchCV Example

Now that you have a strong understanding of the theory behind Scikit-Learn’s GridSearchCV, let’s explore an example. For this example, we’ll use a K-nearest neighbour classifier and run through a number of hyper-parameters.

Let’s load the penguins dataset that comes bundled into Seaborn:

import pandas as pd
from seaborn import load_dataset

# Load dataset and drop any missing values
df = load_dataset('penguins')
df = df.dropna(how='any')

# Create features and target variables
X = df.drop(columns=['species', 'island', 'sex'])
y = df['species']

In the code above, we imported Pandas and the load_dataset() function Seaborn. We dropped any missing records and split the data into a features array (X) and a target Series (y). Let’s see what these two variables look like now:

print(X.head())

# Returns:
#    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g
# 0            39.1           18.7              181.0       3750.0
# 1            39.5           17.4              186.0       3800.0
# 2            40.3           18.0              195.0       3250.0
# 4            36.7           19.3              193.0       3450.0
# 5            39.3           20.6              190.0       3650.0

We can see that we have four columns at our disposal. Similarly, let’s look at what y looks like:

print(y.head())

# Returns:
# 0    Adelie
# 1    Adelie
# 2    Adelie
# 4    Adelie
# 5    Adelie
# Name: species, dtype: object

Now that we have our target and features arrays, we can split the data into training and testing data. For this, we’ll use the train_test_split() function and split the data into 20% testing data.

# Splitting your data into training and testing data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size = 0.2, 
    random_state = 1234
)

From there, we can create a KNN classifier object as well as a GridSearchCV object. For this, we’ll need to import the classes from neighbors and model_selection respectively. We can also define a dictionary of the hyper-parameters we want to evaluate.

A k-nearest neighbour classifier has a number of different hyper-parameters available. In this case, we’ll focus on:

  • n_neighbors, which determines the number of neighbours to look at
  • weights, which determines whether to weigh the distance of each neighbour
  • p, which determines the type of distance measure to use. For example, 1 would imply the use of the Manhattan Distance, while 2 would imply the use of the Euclidian distance.

Let’s create a classifier object, knn, a dictionary of our hyper-parameters, and a GridSearchCV object:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier()
params = {
    'n_neighbors': [3,5,7,9,11,13],
    'weights': ['uniform', 'distance'],
    'p': [1,2]
}

clf = GridSearchCV(
    estimator=knn,
    param_grid=params,
    cv=5,
    n_jobs=5,
    verbose=1
)

At this point, you’ve created a clf object, which is your GridSearchCV object. At this point, we’ve really just instantiated the object. We still haven’t done anything with it in particular.

Let’s apply the .fit() method to the object, by passing in our training data:

# Fitting our GridSearchCV Object
clf.fit(X_train, y_train)

# Returns:
# Fitting 5 folds for each of 24 candidates, totalling 120 fits
# [Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers.
# [Parallel(n_jobs=5)]: Done  50 tasks      | elapsed:    1.9s
# [Parallel(n_jobs=5)]: Done 120 out of 120 | elapsed:    1.9s finished

We can see that, because we instructed Sklearn to be verbose, that our entire task took 1.9s and ran 120 jobs!

At this point, our object contains a number of really helpful attributes. One of these attributes is the .best_params_ attribute. This attribute provides the hyper-parameters that for the given data and options for the hyper-parameters.

# Printing the best parameters
print(clf.best_params_)

# Returns:
# {'n_neighbors': 11, 'p': 1, 'weights': 'distance'}

This indicates that it’s best to use 11 neighbours, the Manhattan distance, and a distance-weighted neighbour search.

Do You Need to Split Data with Sklearn GridSearchCV?

An important topic to consider is whether or not we need to split data into training and testing data when using GridSearchCV. The reason this is a consideration (and not a given), is that the cross validation process itself splits the data into training and testing data.

By first splitting our dataset, we’re effectively reducing the data that can be used by GridSearchCV. There are polarized opinions about whether pre-splitting the data is a good idea or not.

In general, there is potential for data leakage into the hyper-parameters by not first splitting your data. By reserving a percentage of records for your true testing of the model, you’re able to get a more representative view of whether or not the model actually performs effectively.

Limitations of Sklearn GridSearchCV

At first glance, the GridSearchCV class looks like a miracle. It automates some very mundane tasks and gives you a good sense of what hyper-parameters will work best for your model.

That said, there are a number of limitations for the grid search:

  1. .best_params_ doesn’t show the overall best parameters, but rather the best parameters of the ones you passed in to search.
  2. The process can end up being incredibly time consuming. When we fit the data, we noticed that the method ran through 120 instances of our model! Imagine running through a significantly larger dataset, with more parameters.

The reason that this required 120 runs of the model is that each of the hyper-parameters is tested in combination with each other. This is then multiplied by the value of the cross validations that are undertaken.

In our case, we tested with:

  • 6 neighbours
  • 2 distances
  • 2 weights
  • 5 cross validations

This amounts to 6 * 2 * 2 * 5 = 120 tests.

Conclusion

The GridSearchCV class in Scikit-Learn is an amazing tool to help you tune your model’s hyper-parameters. In this tutorial, you learned what hyper-parameters are and what the process of tuning them looks like. You then explored sklearn’s GridSearchCV class and its various parameters. Finally, you learned through a hands-on example how to undertake a grid search. You also learned some of the pitfalls of the sklearn GridSearchCV class.

Additional Resources

To learn about related topics, check out some related articles below:

2 thoughts on “Hyper-parameter Tuning with GridSearchCV in Sklearn”

  1. Great example thanks! Very helpful. Fyi your X_train, y_train split is out of order.

    your code
    X_test, X_train, y_train, y_test = train_test_split(

    just need to switch the X_train & X_test
    X_train, X_test, y_train, y_test = train_test_split(

Leave a Reply

Your email address will not be published.