Pandas isin makes it easy to emulate the SQL IN
and NOT IN
operators to filter your dataframe using the Pandas .isin()
method. In this post, you’ll learn how the .isin()
method works, how to filter a single column, how to filter multiple columns, and how to filter based on conditions not being true.
Table of Contents
Loading a Sample Dataframe
To follow along with this tutorial, feel free to load the same dataframe below. Simply copy the code below into your favourite code editor:
import pandas as pd
df = pd.DataFrame.from_dict(
{
'Name': ['Joan', 'Devi', 'Melissa', 'Dave'],
'Age':[19, 43, 27, 32],
'Gender': ['Female', 'Female', 'Female', 'Male'],
'Education': ['High School', 'College', 'PhD', 'High School'],
'City': ['Atlanta', 'Toronto', 'New York City', 'Madrid']
}
)
print(df)
This returns the following dataframe:
Name Age Gender Education City
0 Joan 19 Female High School Atlanta
1 Devi 43 Female College Toronto
2 Melissa 27 Female PhD New York City
3 Dave 32 Male High School Madrid
Want to learn more? If you want to learn other ways of filtering a Pandas Dataframe, check out my post on filtering data with Pandas here.
Pandas Isin Syntax
Let’s explore the syntax for the .isin()
method before diving into some examples:
DataFrame.isin(values)
The function takes a single parameter values
, where you can pass in an iterable, a Series, a DataFrame or a dictionary. Whatever you pass into the values
parameter is run against a vectorized boolean expression (meaning it’s fast!) and filters your dataframe.
There are some particular notes to keep in mind when you want to filter your dataframe:
- If you pass in a dictionary, the keys must match the column names for which you want to filter the data.
- If you pass in a dataframe, both the columns and the index must match for which you want to filter the data.
Filtering a Single Column with Pandas Isin
Let’s begin filter our dataframe by first focusing on a single column. The .isin()
method is very helpful when you’re trying to filter on multiple options.
For example, let’s say you wanted to filter your data to include people with an Education of either College or PhD. To accomplish this, you could write:
df = df[df['Education'].isin(['College', 'PhD'])]
print(df)
This returns the following dataframe:
Name Age Gender Education City
1 Devi 43 Female College Toronto
2 Melissa 27 Female PhD New York City
Let’s get a better understanding of what’s actually going on here. Running just df['Education'].isin(['College', 'PhD'])
actually returns a boolean array that looks like this:
0 False
1 True
2 True
3 False
That array is then applied to the dataframe df
to filter the dataframe more easily. This allows us to apply the filtering in a vectorized format, allowing it to be much, much faster.
Now let’s take a look at filtering on multiple columns with the .isin()
method.
Filtering Multiple Columns with Pandas Isin
Filtering on multiple columns adds a little bit of complexity to the equation, but it’s nothing you can’t handle! Let’s look at an example first and then break down what’s going on.
In the code below, we want to filter rows where any of the following conditions are met:
- Education is college
- Gender is female
df = df[df[['Education', 'Gender']].isin(['College', 'Female']).any(axis=1)]
print(df)
What we do is we look at the two columns Education
and Gender
by wrapping them in double-square bracket. We then apply the .isin()
method as before. However, we must chain it with the .any()
method in order to filter the dataframe down.
This in essence acts as an or
operator and returns the following dataframe:
Name Age Gender Education City
0 Joan 19 Female High School Atlanta
1 Devi 43 Female College Toronto
2 Melissa 27 Female PhD New York City
If we wanted to apply an and
condition, we would simply use chain the .all()
method, rather than the .any()
:
df = df[df[['Education', 'Gender']].isin(['College', 'Female']).all(axis=1)]
print(df)
This returns the following dataframe:
Name Age Gender Education City
1 Devi 43 Female College Toronto
Filtering an Entire Dataframe with Pandas Isin
Similarly, you can use the .isin()
method to filter data in an entire dataframe. Rather than passing it onto a series, we pass the method onto the dataframe itself.
Let’s explore this with another example. Here we want to filter down to any of the following criteria:
- Gender is male
- Education is college
df = df[df.isin(['College', 'Male']).any(axis=1)]
print(df)
Note, that we’re using the .any()
method again to ensure that any of the conditions can be met. If we did not do this, we’d need to match criteria in every column.
This method can be helpful if you don’t want to specify the columns in which your data can be found (or you just don’t know where it is).
This returns the following dataframe:
Name Age Gender Education City
1 Devi 43 Female College Toronto
3 Dave 32 Male High School Madrid
Filtering Using Pandas Isin Not Matching Condition
Similar to comparing the .isin()
method to SQL’s IN
statement, we can use the Pandas unary operator (~
) to perform a NOT IN
selection.
Let’s take a look at how we can accomplish this using the Pandas .isin()
method on a single column.
Here we want to select any record where education is not equal to ‘College’ or ‘PhD’:
df = df[~df['Education'].isin(['College', 'PhD'])]
print(df)
This returns the following dataframe:
Name Age Gender Education City
0 Joan 19 Female High School Atlanta
3 Dave 32 Male High School Madrid
The same use of the unary operator applies to all other sections of this tutorial as well, in terms of filtering on multiple columns, or the entire dataframe.
Conclusion
In this post, you learned how to emulate the SQL IN
and NOT IN
operators to filter a dataframe using the .isin()
method. You learned how to filter a single column, multiple columns, entire dataframes, and filtering items not matching a condition. If you want to learn more about the .isin()
method, check out the official documentation here.
might want to mention that `df = df[df[[‘Col1’, ‘Col2’]].isin([2, 3]).all(axis=1)]` wont care if 2 is in col1 or col2 andthe same for 3
import pandas as pd
df = pd.DataFrame.from_dict(
{
‘Name’: [‘Joan’, ‘Devi’, ‘Melissa’, ‘Dave’],
‘Col1’: [1, 2, 3, 4],
‘Col2’: [4, 3, 2, 1],
}
)
# print(df)
df = df[df[[‘Col1’, ‘Col2’]].isin([2, 3]).all(axis=1)]
print(df)
this doesnt return the result where col1 is 2 and col2 is 3 per se
Name Col1 Col2
1 Devi 2 3
2 Melissa 3 2
(just something im stuck with at the moment and ended up on this page trying to find an answer (where multiple columns match diff dataframes))
Thanks, William! Great point!