Pandas get dummies(pd.get_dummies()
) allows you to easily one-hot encode your categorical data.
Table of Contents:
- What is one-hot encoding?
- Loading our dataset
- How to use the Pandas get_dummies() function
- How to one-hot encode a Pandas dataframe
- Conclusion
What is one-hot encoding?
One-hot encoding is an important step for preparing your dataset for use in machine learning. One-hot encoding turns your categorical data into a binary vector representation. Pandas get dummies makes this very easy!
This means that for each unique value in a column, a new column is created. The values in this column are represented as 1s and 0s, depending on whether the value matches the column header.
See the image below for a visual representation of what happens:

This can be really helpful for machine learning techniques that require binary and numerical representations of data.
What are some potential drawbacks of one-hot encoding?
One hot-encoding can be very helpful in terms of working with categorical variables. One major drawback, however, is that it creates significantly more data. Because of this, it shouldn’t be used when there are too many categories.
Loading our dataset
Let’s begin this tutorial by loading our required libraries and creating a dataset we can use throughout the tutorial.
import pandas as pd df = pd.DataFrame.from_dict( { 'Name': ['Joan', 'Matt', 'Jeff', 'Melissa', 'Devi'], 'Gender': ['Female', 'Male', 'Male', 'Female', 'Female'], 'House Type': ['Apartment', 'Detached', 'Apartment', 'Semi-Detached', 'Semi-Detached'] } ) print(df)
Printing our dataframe returns:
Name Gender House Type 0 Joan Female Apartment 1 Matt Male Detached 2 Jeff Male Apartment 3 Melissa Female Semi-Detached 4 Devi Female Semi-Detached
How to use the Pandas get dummies function
Using the Pandas get_dummies()
returns a dataframe with the column passed in returned as dummy variables. Let’s see how this works in action:
dummy_gender = pd.get_dummies(df['Gender']) print(dummy_gender)
This returns the following dataframe:
Female Male 0 1 0 1 0 1 2 0 1 3 1 0 4 1 0
This is really helpful, but it unfortunately doesn’t include the other columns. For that to happen, we need to merge it back into the previous dataframe.
df = pd.merge( left=df, right=dummy_gender, left_index=True, right_index=True, ) print(df)
This returns the following:
Name Gender House Type Female Male 0 Joan Female Apartment 1 0 1 Matt Male Detached 0 1 2 Jeff Male Apartment 0 1 3 Melissa Female Semi-Detached 1 0 4 Devi Female Semi-Detached 1 0
Adding a Prefix to the One-hot Encoded Columns
Once you start one-hot encoding multiple columns, it can get a little confusing. Let’s see how to do this using the prefix=
parameter. By default, the prefix= parameter will default to being separated by an underscore (_).
We’ll include the prefix gender:
dummy_gender = pd.get_dummies(df['Gender'], prefix='Gender_') df = pd.merge( left=df, right=dummy_gender, left_index=True, right_index=True, ) print(df)
This returns the following dataframe:
Name Gender House Type Gender_Female Gender_Male 0 Joan Female Apartment 1 0 1 Matt Male Detached 0 1 2 Jeff Male Apartment 0 1 3 Melissa Female Semi-Detached 1 0 4 Devi Female Semi-Detached 1 0
Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!
How to one-hot encode a Pandas dataframe
Following on the example above, let’s take a look at how we can one-hot encode our entire dataframe’s categorical columns.
Using the code below, we loop over different categorical columns, merge them into the original dataframe, and finally drop that column from the dataframe to reduce redundancy.
categorical_columns = ['Gender', 'House Type'] for column in categorical_columns: tempdf = pd.get_dummies(df[column], prefix=column) df = pd.merge( left=df, right=tempdf, left_index=True, right_index=True, ) df = df.drop(columns=column) print(df)
This returns the following:
Name Gender_Female Gender_Male House Type_Apartment House Type_Detached House Type_Semi-Detached 0 Joan 1 0 1 0 0 1 Matt 0 1 0 1 0 2 Jeff 0 1 1 0 0 3 Melissa 1 0 0 0 1 4 Devi 1 0 0 0 1
Conclusion
In this post, you learned how to generate dummy variables and what one-hot encoding is.
To learn more about the pandas get_dummies()
function, check out the official documentation.