There are many times when you may need to set a Pandas column value based on the condition of another column. In this post, you’ll learn all the different ways in which you can create Pandas conditional columns.
Table of Contents
Video Tutorial
If you prefer to follow along with a video tutorial, check out my video below:
Loading a Sample Dataframe
Let’s begin by loading a sample Pandas dataframe that we can use throughout this tutorial.
We’ll begin by import pandas and loading a dataframe using the .from_dict()
method:
import pandas as pd
df = pd.DataFrame.from_dict(
{
'Name': ['Jane', 'Melissa', 'John', 'Matt'],
'Age': [23, 45, 35, 64],
'Birth City': ['London', 'Paris', 'Toronto', 'Atlanta'],
'Gender': ['F', 'F', 'M', 'M']
}
)
print(df)
This returns the following dataframe:
Name Age Birth City Gender
0 Jane 23 London F
1 Melissa 45 Paris F
2 John 35 Toronto M
3 Matt 64 Atlanta M
Using Pandas loc to Set Pandas Conditional Column
Pandas loc is incredibly powerful! If you need a refresher on loc (or iloc), check out my tutorial here. Pandas’ loc creates a boolean mask, based on a condition. Sometimes, that condition can just be selecting rows and columns, but it can also be used to filter dataframes. These filtered dataframes can then have values applied to them.
Let’s explore the syntax a little bit:
df.loc[df[‘column’] condition, ‘new column name’] = ‘value if condition is met’
With the syntax above, we filter the dataframe using .loc
and then assign a value to any row in the column (or columns) where the condition is met.
Let’s try this out by assigning the string ‘Under 30’ to anyone with an age less than 30, and ‘Over 30’ to anyone 30 or older.
df['Age Category'] = 'Over 30'
df.loc[df['Age'] < 30, 'Age Category'] = 'Under 30'
Let's take a look at what we did here:
- We assigned the string 'Over 30' to every record in the dataframe. To learn more about this, check out my post here or creating new columns.
- We then use
.loc
to create a boolean mask on the Age column to filter down to rows where the age is less than 30. When this condition is met, the Age Category column is assigned the new value 'Under 30'
But what happens when you have multiple conditions? You could, of course, use .loc
multiple times, but this is difficult to read and fairly unpleasant to write. Let's see how we can accomplish this using numpy's .select()
method.
Using Numpy Select to Set Values using Multiple Conditions
Similar to the method above to use .loc
to create a conditional column in Pandas, we can use the numpy .select()
method.
Let's begin by importing numpy and we'll give it the conventional alias np
:
import numpy as np
Now, say we wanted to apply a number of different age groups, as below:
- <20 years old,
- 20-39 years old,
- 40-59 years old,
- 60+ years old
In order to do this, we'll create a list of conditions and corresponding values to fill:
conditions = [
(df['Age'] < 20),
(df['Age'] >= 20) & (df['Age'] < 40),
(df['Age'] >= 40) & (df['Age'] < 59),
(df['Age'] >= 60)
]
values = ['<20 years old', '20-39 years old', '40-59 years old', '60+ years old']
df['Age Group'] = np.select(conditions, values)
print(df)
Running this returns the following dataframe:
Name Age Birth City Gender Age Group
0 Jane 23 London F 20-39 years old
1 Melissa 45 Paris F 40-59 years old
2 John 35 Toronto M 20-39 years old
3 Matt 64 Atlanta M 60+ years old
Let's break down what happens here:
- We first define a list of conditions in which the criteria are specified. Recall that lists are ordered meaning that they should be in the order in which you would like the corresponding values to appear.
- We then define a list of values to use, which corresponds to the values you'd like applied in your new column.
Something to consider here is that this can be a bit counterintuitive to write. You can similarly define a function to apply different values. We'll cover this off in the section of using the Pandas .apply() method below.
One of the key benefits is that using numpy as is very fast, especially when compared to using the .apply()
method.
Using Pandas Map to Set Values in Another Column
The Pandas .map()
method is very helpful when you're applying labels to another column. In order to use this method, you define a dictionary to apply to the column.
For our sample dataframe, let's imagine that we have offices in America, Canada, and France. We want to map the cities to their corresponding countries and apply and "Other" value for any other city.
city_dict = {
'Paris': 'France',
'Toronto': 'Canada',
'Atlanta': 'USA'
}
df['Country'] = df['Birth City'].map(city_dict)
print(df)
When we print this out, we get the following dataframe returned:
Name Age Birth City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 Paris F France
2 John 35 Toronto M Canada
3 Matt 64 Atlanta M USA
What we can see here, is that there is a NaN
value associated with any City that doesn't have a corresponding country. If we want to apply "Other" to any missing values, we can chain the .fillna()
method:
city_dict = {
'Paris': 'France',
'Toronto': 'Canada',
'Atlanta': 'USA'
}
df['Country'] = df['Birth City'].map(city_dict).fillna('Other')
print(df)
This returns the following dataframe:
Name Age Birth City Gender Country
0 Jane 23 London F Other
1 Melissa 45 Paris F France
2 John 35 Toronto M Canada
3 Matt 64 Atlanta M USA
Using Pandas Apply to Apply a function to a column
Finally, you can apply built-in or custom functions to a dataframe using the Pandas .apply()
method.
Let's take a look at both applying built-in functions such as len()
and even applying custom functions.
Applying Python Built-in Functions to a Column
We can easily apply a built-in function using the .apply()
method. Let's see how we can use the len()
function to count how long a string of a given column.
df['Name Length'] = df['Name'].apply(len)
print(df)
This returns the following dataframe:
Name Age Birth City Gender Name Length
0 Jane 23 London F 4
1 Melissa 45 Paris F 7
2 John 35 Toronto M 4
3 Matt 64 Atlanta M 4
Take note of a few things here:
- We apply the
.apply()
method to a particular column, - We omit the parentheses "()"
Using Third-Party Packages in Pandas Apply
Similarly, you can use functions from using packages. Let's use numpy to apply the .sqrt()
method to find the scare root of a person's age.
import numpy as np
df['Age Squareroot'] = df['Age'].apply(np.sqrt)
print(df)
This returns the following dataframe:
Name Age Birth City Gender Age Squareroot
0 Jane 23 London F 4.795832
1 Melissa 45 Paris F 6.708204
2 John 35 Toronto M 5.916080
3 Matt 64 Atlanta M 8.000000
Using Custom Functions with Pandas Apply
Something that makes the .apply()
method extremely powerful is the ability to define and apply your own functions.
Let's revisit how we could use an if-else statement to create age categories as in our earlier example:
def age_groups(x):
if x < 20:
return '<20 years old'
elif x < 40:
return '20-39 years old'
elif x < 60:
return '40-59 years old'
else:
return '60+ years old'
df['Age Group'] = df['Age'].apply(age_groups)
print(df)
This returns the following dataframe:
Name Age Birth City Gender Age Group
0 Jane 23 London F 20-39 years old
1 Melissa 45 Paris F 40-59 years old
2 John 35 Toronto M 20-39 years old
3 Matt 64 Atlanta M 60+ years old
Conclusion
In this post, you learned a number of ways in which you can apply values to a dataframe column to create a Pandas conditional column, including using .loc
, .np.select()
, Pandas .map()
and Pandas .apply()
. Each of these methods has a different use case that we explored throughout this post.
Learn more about Pandas methods covered here by checking out their official documentation:
Thank you so much! Brilliantly explained!!!
Thanks Aisha!