Pandas get dummies (One-Hot Encoding) Explained

Pandas get dummies for one-hot encoding

Pandas get dummies(pd.get_dummies()) allows you to easily one-hot encode your categorical data.

In this tutorial, you’ll learn what one-hot encoding is, what some potential drawbacks of one-hot encoding are, and how to one-hot encode with Pandas, including how to customize the output.

What is one-hot encoding?

One-hot encoding is an important step for preparing your dataset for use in machine learning. One-hot encoding turns your categorical data into a binary vector representation. Pandas get dummies makes this very easy!

This means that for each unique value in a column, a new column is created. The values in this column are represented as 1s and 0s, depending on whether the value matches the column header.

See the image below for a visual representation of what happens:

This can be really helpful for machine learning techniques that require binary and numerical representations of data.

What are some potential drawbacks of one-hot encoding?

One hot-encoding can be very helpful in terms of working with categorical variables. One major drawback, however, is that it creates significantly more data. Because of this, it shouldn’t be used when there are too many categories.

Loading our dataset

Let’s begin this tutorial by loading our required libraries and creating a dataset we can use throughout the tutorial.

import pandas as pd

df = pd.DataFrame.from_dict(
    {
        'Name': ['Joan', 'Matt', 'Jeff', 'Melissa', 'Devi'],
        'Gender': ['Female', 'Male', 'Male', 'Female', 'Female'],
        'House Type': ['Apartment', 'Detached', 'Apartment', 'Semi-Detached', 'Semi-Detached']
    }
)

print(df)

Printing our dataframe returns:

      Name  Gender     House Type
0     Joan  Female      Apartment
1     Matt    Male       Detached
2     Jeff    Male      Apartment
3  Melissa  Female  Semi-Detached
4     Devi  Female  Semi-Detached

How to use the Pandas get dummies function

Using the Pandas get_dummies() returns a dataframe with the column passed in returned as dummy variables. Let’s see how this works in action:

dummy_gender = pd.get_dummies(df['Gender'])

print(dummy_gender)

This returns the following dataframe:

   Female  Male
0       1     0
1       0     1
2       0     1
3       1     0
4       1     0

This is really helpful, but it unfortunately doesn’t include the other columns. For that to happen, we need to merge it back into the previous dataframe.

df = pd.merge(
    left=df,
    right=dummy_gender,
    left_index=True,
    right_index=True,
)

print(df)

This returns the following:

      Name  Gender     House Type  Female  Male
0     Joan  Female      Apartment       1     0
1     Matt    Male       Detached       0     1
2     Jeff    Male      Apartment       0     1
3  Melissa  Female  Semi-Detached       1     0
4     Devi  Female  Semi-Detached       1     0

Adding a Prefix to the One-hot Encoded Columns

Once you start one-hot encoding multiple columns, it can get a little confusing. Let’s see how to do this using the prefix= parameter. By default, the prefix= parameter will default to being separated by an underscore (_).

We’ll include the prefix gender:

dummy_gender = pd.get_dummies(df['Gender'], prefix='Gender_')

df = pd.merge(
    left=df,
    right=dummy_gender,
    left_index=True,
    right_index=True,
)

print(df)

This returns the following dataframe:

      Name  Gender     House Type  Gender_Female  Gender_Male
0     Joan  Female      Apartment              1            0
1     Matt    Male       Detached              0            1
2     Jeff    Male      Apartment              0            1
3  Melissa  Female  Semi-Detached              1            0
4     Devi  Female  Semi-Detached              1            0

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

How to one-hot encode a Pandas dataframe

Following on the example above, let’s take a look at how we can one-hot encode our entire dataframe’s categorical columns.

Using the code below, we loop over different categorical columns, merge them into the original dataframe, and finally drop that column from the dataframe to reduce redundancy.

categorical_columns = ['Gender', 'House Type']

for column in categorical_columns:
    tempdf = pd.get_dummies(df[column], prefix=column)

    df = pd.merge(
        left=df,
        right=tempdf,
        left_index=True,
        right_index=True,
    )

    df = df.drop(columns=column)

print(df)

This returns the following:

      Name  Gender_Female  Gender_Male  House Type_Apartment  House Type_Detached  House Type_Semi-Detached
0     Joan              1            0                     1                    0                         0
1     Matt              0            1                     0                    1                         0
2     Jeff              0            1                     1                    0                         0
3  Melissa              1            0                     0                    0                         1
4     Devi              1            0                     0                    0                         1

Conclusion

In this post, you learned how to generate dummy variables and what one-hot encoding is.

To learn more about the pandas get_dummies() function, check out the official documentation.