Pandas is the essential data analysis library in Python. Being able to use the library to filter data in meaningful ways will make you a stronger programmer. In this tutorial, you’ll learn how to use the Pandas query function to filter a DataFrame in plain English. One of the many perks of the function is the ability to use SQL-like filter statements to filter your dataset.
By the end of this tutorial, you’ll have learned:
- How to use the Pandas
query()
function to filter a DataFrame - How to use the Pandas query function to filter one or more columns
- How to compare multiple columns using Pandas query
- How to filter a Pandas DataFrame in place
Table of Contents
Understanding the Pandas Query Function
Let’s dive into exploring the Pandas query()
function to better understand the parameters and default arguments that the function provides.
# Understanding the Pandas query() Function
import pandas as pd
DataFrame.query(expr, inplace=False, **kwargs)
We can see that the Pandas query()
function has two parameters:
expr=
represents the expression to use to filter the DataFrameinplace=
instructs Pandas to filter the DataFrame in place and defaults to False
Now that you have a strong understanding of the function, let’s dive into using it to filter data.
Loading a Sample Pandas DataFrame
To follow along with this tutorial, let’s load a sample Pandas DataFrame. We can load the DataFrame from the file hosted on my GitHub page, using the pd.read_excel()
function. Then we can print out the first five records of the dataset using the .head()
method.
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
print(df.head())
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children's Clothing 18.0 306
# 1 2020-09-23 North Children's Clothing 14.0 448
# 2 2020-04-02 South Women's Clothing 17.0 425
# 3 2020-02-28 East Children's Clothing 26.0 832
# 4 2020-03-19 West Women's Clothing 3.0 33
Based on the output of the first five rows shown above, we can see that we have five columns to work with:
Date
is a date formatted columnRegion
andType
are both stringsUnits
andSales
are numeric
Now that you have a good understanding of the DataFrame, let’s dive into how to use the Pandas query function.
Using Pandas Query to Filter a DataFrame
The Pandas query method lets you filter a DataFrame using SQL-like, plain-English statements. The method allows you to pass in a string that filters a DataFrame to a boolean expression.
The Pandas .query()
method lets you pass in a string that represents a filter expression. The syntax can feel a little awkward at first but if you’re familiar with SQL, the format will feel very natural. Let’s take a look at an example where we filter the DataFrame to show only rows where Units are less than 4.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query('Units < 4')
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 4 2020-03-19 West Women's Clothing 3.0 33
# 28 2020-01-19 East Men's Clothing 3.0 63
# 96 2020-11-13 East Children's Clothing 3.0 72
# 118 2020-12-28 East Children's Clothing 3.0 78
# 134 2020-09-04 North Children's Clothing 3.0 184
We can see that by passing in a string that represents the expression, in this case 'Units < 4'
. In this example, we filtered a numeric column. However, we can also filter string columns. Let’s take a look at an example:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query('Region == "West"')
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 4 2020-03-19 West Women's Clothing 3.0 33
# 15 2020-11-26 West Men's Clothing 27.0 864
# 21 2020-06-23 West Women's Clothing 18.0 288
# 24 2020-06-18 West Men's Clothing 5.0 70
# 30 2020-07-13 West Children's Clothing 30.0 450
In the example above, we filtered the DataFrame to only show records where the Region is equal to West. In this case, we need to specify the string with an extra set of quotes. In this case, it’s important that we use non-matching quotation marks, so as to not accidentally close the string.
In the following section, you’ll learn how to use the Pandas query method with multiple conditions.
Using Pandas Query with Multiple Conditions
The Pandas query method can also be used to filter with multiple conditions. This allows us to specify conditions using the logical and
or or
operators. By using multiple conditions, we can write powerful statements that filter based on single or multiple columns.
Let’s see how we can use the method to filter data based on the Region and Units column.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query('Region == "West" and Units < 4')
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 4 2020-03-19 West Women's Clothing 3.0 33
# 135 2020-01-07 West Women's Clothing 3.0 350
# 355 2020-06-12 West Children's Clothing 3.0 567
# 558 2020-10-06 West Men's Clothing 3.0 462
# 686 2020-02-18 West Men's Clothing 3.0 918
In the example above, we pass in two conditions: one based on a numeric column and another based on a string column. We use the and
operator to make sure that both conditions are met.
Similarly, we can modify the expression to use the or
operator to make sure that either of the conditions is met:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query('Region == "West" or Units < 4')
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 4 2020-03-19 West Women's Clothing 3.0 33
# 15 2020-11-26 West Men's Clothing 27.0 864
# 21 2020-06-23 West Women's Clothing 18.0 288
# 24 2020-06-18 West Men's Clothing 5.0 70
# 28 2020-01-19 East Men's Clothing 3.0 63
In the example above, we repeat our previous filter but use the or
operator instead. This means that the data are filtered to records where either the Region is equal to West or the Units are less than 4.
Using Variables in Pandas Query
In order to use a variable in the Pandas query method you can preface the variable with an @ symbol. This will extract the value and replace it in the filter expression.
In the example below, we’ll declare a variable value
and assign the value of 25. Let’s see how we can use that variable in our query expression:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date'])
value = 25
filtered = df.query("Units > @value")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 3 2020-02-28 East Children's Clothing 26.0 832
# 5 2020-02-05 North Women's Clothing 33.0 627
# 7 2020-03-25 East Women's Clothing 29.0 609
# 9 2020-11-03 East Children's Clothing 34.0 374
# 14 2020-01-07 East Men's Clothing 30.0 360
In the example above, we write our filter expression as 'Units > @value'
, where the @ sign references the variable that we defined earlier.
Using Pandas Query with Not Filter
Similarly, we can use the Pandas query method to create filter expressions where a filter is not equal to a value. For this, we can use the not
operator, which will inverse the returned boolean expression. In the example below, we filter to any records where the Region is not equal to West.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query('not Region == "West"')
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children's Clothing 18.0 306
# 1 2020-09-23 North Children's Clothing 14.0 448
# 2 2020-04-02 South Women's Clothing 17.0 425
# 3 2020-02-28 East Children's Clothing 26.0 832
# 5 2020-02-05 North Women's Clothing 33.0 627
The filter expression above filters to any records where region is not equal to West. This is done by pretending the not
operator to inverse the selection.
Using Pandas Query to Check if String Contains
We can also use the Pandas query method to check if a string contains a certain substring. This can be helpful when you want to filter a DataFrame based on a string. The syntax for this is a little less intuitive but works well in the query method. Let’s take a look at an example to filter our records to only show those where Region contains the substring 'st'
.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query("Region.str.contains('st')")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children's Clothing 18.0 306
# 3 2020-02-28 East Children's Clothing 26.0 832
# 4 2020-03-19 West Women's Clothing 3.0 33
# 7 2020-03-25 East Women's Clothing 29.0 609
# 9 2020-11-03 East Children's Clothing 34.0 374
In the example above, we applied the .str.contains()
method to the column. While in my view this is less clear than simply applying this to a column directly if you’re working with other query
filters it can be helpful to stick to the same methods.
Using Functions in Pandas Query
One of the great features of the Pandas .query()
method is that you can also filter based on values being passed into a function. Let’s see how we can use a function to filter a DataFrame using the query method.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
df['Sales'] = df['Sales'] * -1
filtered = df.query("abs(Sales) < 40")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 4 2020-03-19 West Women's Clothing 3.0 -33
# 157 2020-01-17 East Men's Clothing 14.0 -36
# 164 2020-03-02 East Children's Clothing 31.0 -36
# 430 2020-02-28 East Men's Clothing 34.0 -39
# 901 2020-04-24 East Men's Clothing 28.0 -36
Let’s break down what the code block above is doing:
- We multiply the Sales column by -1 to convert it to negative values
- We then filter the DataFrame using the abs() function, which returns the absolute value of a value.
- We filter the data to only include values less than 40.
In the following section, you’ll learn how to use the method to check if a value is in a list of values.
Using Pandas Query “in” to Check a List of Values
The Pandas query method makes it very easy to search for records that contain a value from a list of values. This is similar to using the Pandas isin method which can be used to filter records that contain an item from a list of values.
Let’s see how we can use Pandas to filter records that contain an item from a list:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query("Region in ['West', 'East']")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children's Clothing 18.0 306
# 3 2020-02-28 East Children's Clothing 26.0 832
# 4 2020-03-19 West Women's Clothing 3.0 33
# 7 2020-03-25 East Women's Clothing 29.0 609
# 9 2020-11-03 East Children's Clothing 34.0 374
In the example above, we checked whether or not the Region value was in the list of values containing ['West', 'East']
. This is simpler than chaining multiple conditions and is much easier to read.
Using Pandas Query to Filter Not In
Similar to the example above, we can intuitively filter data where items don’t match a given condition. This approach allows you to filter data based on plain English conditions, making it much simpler for readers of your code to understand what is happening. Let’s take a look at an example:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query("Region not in ['West', 'East']")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 1 2020-09-23 North Children's Clothing 14.0 448
# 2 2020-04-02 South Women's Clothing 17.0 425
# 5 2020-02-05 North Women's Clothing 33.0 627
# 6 2020-01-24 South Women's Clothing 12.0 396
# 8 2020-01-03 North Children's Clothing 18.0 486
In the example above we use the not in
operator to filter our DataFrame based on inverse selections. This can be incredibly helpful when we don’t know which other values may be in a column but we want to filter the data based on not meeting a condition.
Using Pandas Query to Compare Two Columns
You can also use the Pandas query method to compare two columns in plain English. This allows you to use a simple statement to compare two columns, generally numeric columns. In the example below, we’ll add a random value from -3 to 3 to our Sales column. We can then filter our records to only include records where Sales is larger than our new Sales2 column
from random import randint
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
df['Sales2'] = df['Sales'] + randint(-3, 3)
filtered = df.query("Sales > Sales2")
print(filtered.head())
# Returns:
# Date Region Type Units Sales Sales2
# 0 2020-07-11 East Children's Clothing 18.0 306 304
# 1 2020-09-23 North Children's Clothing 14.0 448 446
# 2 2020-04-02 South Women's Clothing 17.0 425 423
# 3 2020-02-28 East Children's Clothing 26.0 832 830
# 4 2020-03-19 West Women's Clothing 3.0 33 31
In the example above we pass 'Sales > Sales2'
into the query method, which allows us to return only records where the Sales column is larger than the other column.
Using Pandas Query to Filter Dates
We can also use the Pandas query method to easily filter a column based on dates. This allows you to use simple language to filter a column based on a date condition. You can use any of the comparison operators, such as >
, <
, ==
, and more. Let’s take a look at how we can filter the Date column based on being after 2020-07-01
:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query("Date > '2020-07-01'")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children's Clothing 18.0 306
# 1 2020-09-23 North Children's Clothing 14.0 448
# 9 2020-11-03 East Children's Clothing 34.0 374
# 11 2020-08-09 North Men's Clothing NaN 270
# 13 2020-08-11 East Children's Clothing 12.0 348
In the example above, we filter the Date column based on following a given date. It’s important to pass the date in as a string, otherwise, the method won’t work.
Using Pandas Query to Filter the DataFrame Index
We can also use the Pandas query method to filter a DataFrame based on the index column. In this case, we pass in index
as we would any other column. Let’s see how we can filter our DataFrame based on the index value:
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
filtered = df.query("index < 5")
print(filtered.head())
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children's Clothing 18.0 306
# 1 2020-09-23 North Children's Clothing 14.0 448
# 2 2020-04-02 South Women's Clothing 17.0 425
# 3 2020-02-28 East Children's Clothing 26.0 832
# 4 2020-03-19 West Women's Clothing 3.0 33
In the example above, we filter the DataFrame based on the value in the index. In most cases, this will make more sense when we have a more meaningful index, but the example above illustrates the point.
Using Pandas Query to Filter a DataFrame In Place
In all of the examples above, we filtered the DataFrame by assigning it to a new DataFrame, filtered
. However, you can also filter the DataFrame in place. This gives you the benefit of not needing to reassign the data and can also be more memory efficient.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
df.query("Units < 4", inplace=True)
print(df.head())
# Returns:
# Date Region Type Units Sales
# 4 2020-03-19 West Women's Clothing 3.0 33
# 28 2020-01-19 East Men's Clothing 3.0 63
# 96 2020-11-13 East Children's Clothing 3.0 72
# 118 2020-12-28 East Children's Clothing 3.0 78
# 134 2020-09-04 North Children's Clothing 3.0 184
In the example above, we pass in the inplace=True
argument to allow filter our data in place.
Using Columns with Spaces in Pandas Query
While the Pandas query method seems to be able to handle most operations, it can struggle a little bit with columns that have spaces in them. However, we can also use columns with spaces, though they require a bit more work. Let’s assign a new column that contains spaces in the name and see how we can filter the data.
import pandas as pd
df = pd.read_excel(
'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
parse_dates=['Date']
)
df['Column With Spaces'] = df['Units']
filtered = df.query("`Column With Spaces` < 4")
print(filtered.head())
# Returns:
# Date Region Type Units Sales Column With Spaces
# 4 2020-03-19 West Women's Clothing 3.0 33 3.0
# 28 2020-01-19 East Men's Clothing 3.0 63 3.0
# 96 2020-11-13 East Children's Clothing 3.0 72 3.0
# 118 2020-12-28 East Children's Clothing 3.0 78 3.0
# 134 2020-09-04 North Children's Clothing 3.0 184 3.0
In order to filter a DataFrame column that has spaces with the query method, we wrap the column using backticks. While this may not look great, it does allow us to use any column in the method.
Conclusion
In this guide, you learned how to use the Pandas query method to filter a DataFrame using plain English statements. The method filters a DataFrame based on an expression evaluating to a True or False. The method allows for a ton of flexibility in how you filter the data. At its core, however, the method lets you use plain English statements to filter your data.
Additional Resources
To learn more about related topics, check out the tutorials below: