Skip to content

How to Use Pandas Query to Filter a DataFrame

How to Use Pandas Query to Filter a DataFrame Cover Image

Pandas is the essential data analysis library in Python. Being able to use the library to filter data in meaningful ways will make you a stronger programmer. In this tutorial, you’ll learn how to use the Pandas query function to filter a DataFrame in plain English. One of the many perks of the function is the ability to use SQL-like filter statements to filter your dataset.

By the end of this tutorial, you’ll have learned:

  • How to use the Pandas query() function to filter a DataFrame
  • How to use the Pandas query function to filter one or more columns
  • How to compare multiple columns using Pandas query
  • How to filter a Pandas DataFrame in place

Understanding the Pandas Query Function

Let’s dive into exploring the Pandas query() function to better understand the parameters and default arguments that the function provides.

# Understanding the Pandas query() Function
import pandas as pd
DataFrame.query(expr, inplace=False, **kwargs)

We can see that the Pandas query() function has two parameters:

  1. expr= represents the expression to use to filter the DataFrame
  2. inplace= instructs Pandas to filter the DataFrame in place and defaults to False

Now that you have a strong understanding of the function, let’s dive into using it to filter data.

Loading a Sample Pandas DataFrame

To follow along with this tutorial, let’s load a sample Pandas DataFrame. We can load the DataFrame from the file hosted on my GitHub page, using the pd.read_excel() function. Then we can print out the first five records of the dataset using the .head() method.

# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.read_excel(
   'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx', 
   parse_dates=['Date']
)
print(df.head())

# Returns:
#         Date Region                 Type  Units  Sales
# 0 2020-07-11   East  Children's Clothing   18.0    306
# 1 2020-09-23  North  Children's Clothing   14.0    448
# 2 2020-04-02  South     Women's Clothing   17.0    425
# 3 2020-02-28   East  Children's Clothing   26.0    832
# 4 2020-03-19   West     Women's Clothing    3.0     33

Based on the output of the first five rows shown above, we can see that we have five columns to work with:

  • Date is a date formatted column
  • Region and Type are both strings
  • Units and Sales are numeric

Now that you have a good understanding of the DataFrame, let’s dive into how to use the Pandas query function.

Using Pandas Query to Filter a DataFrame

How can you filter a Pandas DataFrame with the query method?

The Pandas query method lets you filter a DataFrame using SQL-like, plain-English statements. The method allows you to pass in a string that filters a DataFrame to a boolean expression.

The Pandas .query() method lets you pass in a string that represents a filter expression. The syntax can feel a little awkward at first but if you’re familiar with SQL, the format will feel very natural. Let’s take a look at an example where we filter the DataFrame to show only rows where Units are less than 4.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query('Units < 4')
print(filtered.head())

# Returns:
#           Date Region                 Type  Units  Sales
# 4   2020-03-19   West     Women's Clothing    3.0     33
# 28  2020-01-19   East       Men's Clothing    3.0     63
# 96  2020-11-13   East  Children's Clothing    3.0     72
# 118 2020-12-28   East  Children's Clothing    3.0     78
# 134 2020-09-04  North  Children's Clothing    3.0    184

We can see that by passing in a string that represents the expression, in this case 'Units < 4'. In this example, we filtered a numeric column. However, we can also filter string columns. Let’s take a look at an example:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query('Region == "West"')
print(filtered.head())

# Returns:
#          Date Region                 Type  Units  Sales
# 4  2020-03-19   West     Women's Clothing    3.0     33
# 15 2020-11-26   West       Men's Clothing   27.0    864
# 21 2020-06-23   West     Women's Clothing   18.0    288
# 24 2020-06-18   West       Men's Clothing    5.0     70
# 30 2020-07-13   West  Children's Clothing   30.0    450

In the example above, we filtered the DataFrame to only show records where the Region is equal to West. In this case, we need to specify the string with an extra set of quotes. In this case, it’s important that we use non-matching quotation marks, so as to not accidentally close the string.

In the following section, you’ll learn how to use the Pandas query method with multiple conditions.

Using Pandas Query with Multiple Conditions

The Pandas query method can also be used to filter with multiple conditions. This allows us to specify conditions using the logical and or or operators. By using multiple conditions, we can write powerful statements that filter based on single or multiple columns.

Let’s see how we can use the method to filter data based on the Region and Units column.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query('Region == "West" and Units < 4')
print(filtered.head())

# Returns:
#           Date Region                 Type  Units  Sales
# 4   2020-03-19   West     Women's Clothing    3.0     33
# 135 2020-01-07   West     Women's Clothing    3.0    350
# 355 2020-06-12   West  Children's Clothing    3.0    567
# 558 2020-10-06   West       Men's Clothing    3.0    462
# 686 2020-02-18   West       Men's Clothing    3.0    918

In the example above, we pass in two conditions: one based on a numeric column and another based on a string column. We use the and operator to make sure that both conditions are met.

Similarly, we can modify the expression to use the or operator to make sure that either of the conditions is met:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query('Region == "West" or Units < 4')
print(filtered.head())

# Returns:
#          Date Region              Type  Units  Sales
# 4  2020-03-19   West  Women's Clothing    3.0     33
# 15 2020-11-26   West    Men's Clothing   27.0    864
# 21 2020-06-23   West  Women's Clothing   18.0    288
# 24 2020-06-18   West    Men's Clothing    5.0     70
# 28 2020-01-19   East    Men's Clothing    3.0     63

In the example above, we repeat our previous filter but use the or operator instead. This means that the data are filtered to records where either the Region is equal to West or the Units are less than 4.

Using Variables in Pandas Query

How can you use variables in Pandas query?

In order to use a variable in the Pandas query method you can preface the variable with an @ symbol. This will extract the value and replace it in the filter expression.

In the example below, we’ll declare a variable value and assign the value of 25. Let’s see how we can use that variable in our query expression:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date'])

value = 25
filtered = df.query("Units > @value")
print(filtered.head())

# Returns:
#          Date Region                 Type  Units  Sales
# 3  2020-02-28   East  Children's Clothing   26.0    832
# 5  2020-02-05  North     Women's Clothing   33.0    627
# 7  2020-03-25   East     Women's Clothing   29.0    609
# 9  2020-11-03   East  Children's Clothing   34.0    374
# 14 2020-01-07   East       Men's Clothing   30.0    360

In the example above, we write our filter expression as 'Units > @value', where the @ sign references the variable that we defined earlier.

Using Pandas Query with Not Filter

Similarly, we can use the Pandas query method to create filter expressions where a filter is not equal to a value. For this, we can use the not operator, which will inverse the returned boolean expression. In the example below, we filter to any records where the Region is not equal to West.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query('not Region == "West"')
print(filtered.head())

# Returns:
#         Date Region                 Type  Units  Sales
# 0 2020-07-11   East  Children's Clothing   18.0    306
# 1 2020-09-23  North  Children's Clothing   14.0    448
# 2 2020-04-02  South     Women's Clothing   17.0    425
# 3 2020-02-28   East  Children's Clothing   26.0    832
# 5 2020-02-05  North     Women's Clothing   33.0    627

The filter expression above filters to any records where region is not equal to West. This is done by pretending the not operator to inverse the selection.

Using Pandas Query to Check if String Contains

We can also use the Pandas query method to check if a string contains a certain substring. This can be helpful when you want to filter a DataFrame based on a string. The syntax for this is a little less intuitive but works well in the query method. Let’s take a look at an example to filter our records to only show those where Region contains the substring 'st'.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query("Region.str.contains('st')")
print(filtered.head())

# Returns:
#         Date Region                 Type  Units  Sales
# 0 2020-07-11   East  Children's Clothing   18.0    306
# 3 2020-02-28   East  Children's Clothing   26.0    832
# 4 2020-03-19   West     Women's Clothing    3.0     33
# 7 2020-03-25   East     Women's Clothing   29.0    609
# 9 2020-11-03   East  Children's Clothing   34.0    374

In the example above, we applied the .str.contains() method to the column. While in my view this is less clear than simply applying this to a column directly if you’re working with other query filters it can be helpful to stick to the same methods.

Using Functions in Pandas Query

One of the great features of the Pandas .query() method is that you can also filter based on values being passed into a function. Let’s see how we can use a function to filter a DataFrame using the query method.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

df['Sales'] = df['Sales'] * -1

filtered = df.query("abs(Sales) < 40")
print(filtered.head())

# Returns:
#           Date Region                 Type  Units  Sales
# 4   2020-03-19   West     Women's Clothing    3.0    -33
# 157 2020-01-17   East       Men's Clothing   14.0    -36
# 164 2020-03-02   East  Children's Clothing   31.0    -36
# 430 2020-02-28   East       Men's Clothing   34.0    -39
# 901 2020-04-24   East       Men's Clothing   28.0    -36

Let’s break down what the code block above is doing:

  1. We multiply the Sales column by -1 to convert it to negative values
  2. We then filter the DataFrame using the abs() function, which returns the absolute value of a value.
  3. We filter the data to only include values less than 40.

In the following section, you’ll learn how to use the method to check if a value is in a list of values.

Using Pandas Query “in” to Check a List of Values

The Pandas query method makes it very easy to search for records that contain a value from a list of values. This is similar to using the Pandas isin method which can be used to filter records that contain an item from a list of values.

Let’s see how we can use Pandas to filter records that contain an item from a list:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query("Region in ['West', 'East']")
print(filtered.head())

# Returns:
#         Date Region                 Type  Units  Sales
# 0 2020-07-11   East  Children's Clothing   18.0    306
# 3 2020-02-28   East  Children's Clothing   26.0    832
# 4 2020-03-19   West     Women's Clothing    3.0     33
# 7 2020-03-25   East     Women's Clothing   29.0    609
# 9 2020-11-03   East  Children's Clothing   34.0    374

In the example above, we checked whether or not the Region value was in the list of values containing ['West', 'East']. This is simpler than chaining multiple conditions and is much easier to read.

Using Pandas Query to Filter Not In

Similar to the example above, we can intuitively filter data where items don’t match a given condition. This approach allows you to filter data based on plain English conditions, making it much simpler for readers of your code to understand what is happening. Let’s take a look at an example:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query("Region not in ['West', 'East']")
print(filtered.head())

# Returns:
#         Date Region                 Type  Units  Sales
# 1 2020-09-23  North  Children's Clothing   14.0    448
# 2 2020-04-02  South     Women's Clothing   17.0    425
# 5 2020-02-05  North     Women's Clothing   33.0    627
# 6 2020-01-24  South     Women's Clothing   12.0    396
# 8 2020-01-03  North  Children's Clothing   18.0    486

In the example above we use the not in operator to filter our DataFrame based on inverse selections. This can be incredibly helpful when we don’t know which other values may be in a column but we want to filter the data based on not meeting a condition.

Using Pandas Query to Compare Two Columns

You can also use the Pandas query method to compare two columns in plain English. This allows you to use a simple statement to compare two columns, generally numeric columns. In the example below, we’ll add a random value from -3 to 3 to our Sales column. We can then filter our records to only include records where Sales is larger than our new Sales2 column

from random import randint
import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

df['Sales2'] = df['Sales'] + randint(-3, 3)
filtered = df.query("Sales > Sales2")
print(filtered.head())

# Returns:
#         Date Region                 Type  Units  Sales  Sales2
# 0 2020-07-11   East  Children's Clothing   18.0    306     304
# 1 2020-09-23  North  Children's Clothing   14.0    448     446
# 2 2020-04-02  South     Women's Clothing   17.0    425     423
# 3 2020-02-28   East  Children's Clothing   26.0    832     830
# 4 2020-03-19   West     Women's Clothing    3.0     33      31

In the example above we pass 'Sales > Sales2' into the query method, which allows us to return only records where the Sales column is larger than the other column.

Using Pandas Query to Filter Dates

We can also use the Pandas query method to easily filter a column based on dates. This allows you to use simple language to filter a column based on a date condition. You can use any of the comparison operators, such as >, <, ==, and more. Let’s take a look at how we can filter the Date column based on being after 2020-07-01:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query("Date > '2020-07-01'")
print(filtered.head())

# Returns:
#          Date Region                 Type  Units  Sales
# 0  2020-07-11   East  Children's Clothing   18.0    306
# 1  2020-09-23  North  Children's Clothing   14.0    448
# 9  2020-11-03   East  Children's Clothing   34.0    374
# 11 2020-08-09  North       Men's Clothing    NaN    270
# 13 2020-08-11   East  Children's Clothing   12.0    348

In the example above, we filter the Date column based on following a given date. It’s important to pass the date in as a string, otherwise, the method won’t work.

Using Pandas Query to Filter the DataFrame Index

We can also use the Pandas query method to filter a DataFrame based on the index column. In this case, we pass in index as we would any other column. Let’s see how we can filter our DataFrame based on the index value:

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

filtered = df.query("index < 5")
print(filtered.head())

# Returns:
#         Date Region                 Type  Units  Sales
# 0 2020-07-11   East  Children's Clothing   18.0    306
# 1 2020-09-23  North  Children's Clothing   14.0    448
# 2 2020-04-02  South     Women's Clothing   17.0    425
# 3 2020-02-28   East  Children's Clothing   26.0    832
# 4 2020-03-19   West     Women's Clothing    3.0     33

In the example above, we filter the DataFrame based on the value in the index. In most cases, this will make more sense when we have a more meaningful index, but the example above illustrates the point.

Using Pandas Query to Filter a DataFrame In Place

In all of the examples above, we filtered the DataFrame by assigning it to a new DataFrame, filtered. However, you can also filter the DataFrame in place. This gives you the benefit of not needing to reassign the data and can also be more memory efficient.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

df.query("Units < 4", inplace=True)
print(df.head())

# Returns:
#           Date Region                 Type  Units  Sales
# 4   2020-03-19   West     Women's Clothing    3.0     33
# 28  2020-01-19   East       Men's Clothing    3.0     63
# 96  2020-11-13   East  Children's Clothing    3.0     72
# 118 2020-12-28   East  Children's Clothing    3.0     78
# 134 2020-09-04  North  Children's Clothing    3.0    184

In the example above, we pass in the inplace=True argument to allow filter our data in place.

Using Columns with Spaces in Pandas Query

While the Pandas query method seems to be able to handle most operations, it can struggle a little bit with columns that have spaces in them. However, we can also use columns with spaces, though they require a bit more work. Let’s assign a new column that contains spaces in the name and see how we can filter the data.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx',
    parse_dates=['Date']
)

df['Column With Spaces'] = df['Units']

filtered = df.query("`Column With Spaces` < 4")
print(filtered.head())

# Returns:
#           Date Region                 Type  Units  Sales  Column With Spaces
# 4   2020-03-19   West     Women's Clothing    3.0     33                 3.0
# 28  2020-01-19   East       Men's Clothing    3.0     63                 3.0
# 96  2020-11-13   East  Children's Clothing    3.0     72                 3.0
# 118 2020-12-28   East  Children's Clothing    3.0     78                 3.0
# 134 2020-09-04  North  Children's Clothing    3.0    184                 3.0

In order to filter a DataFrame column that has spaces with the query method, we wrap the column using backticks. While this may not look great, it does allow us to use any column in the method.

Conclusion

In this guide, you learned how to use the Pandas query method to filter a DataFrame using plain English statements. The method filters a DataFrame based on an expression evaluating to a True or False. The method allows for a ton of flexibility in how you filter the data. At its core, however, the method lets you use plain English statements to filter your data.

Additional Resources

To learn more about related topics, check out the tutorials below:

Leave a Reply

Your email address will not be published. Required fields are marked *