Skip to content

Pandas Groupby and Aggregate for Multiple Columns

Pandas Groupby and Aggregate for Multiple Columns Cover Image

The Pandas groupby method is incredibly powerful and even lets you group by and aggregate multiple columns. In this tutorial, you’ll learn how to use the Pandas groupby method to aggregate multiple columns. The syntax of the method can be a little confusing at first. Don’t worry – this tutorial will simplify this. If you’re looking to get a deep dive on the Pandas groupby method, we’ve got you covered, too.

By the end of this tutorial, you’ll have learned:

  • How to use Pandas groupby for multiple columns
  • How to use multiple aggregations for multiple columns in Pandas groupby
  • How to use different aggregations for multiple columns in groupby

How to use Pandas Groupby with Multiple Columns

How can you use the Pandas groupby method with multiple columns?

To use Pandas groupby with multiple columns, you can pass in a list of column headers directly into the method. The order in which you pass columns into the list determines the hierarchy of columns you use.

To start, let’s load a sample Pandas DataFrame. We’ll use the same dataset as we did in our in-depth guide to Pandas pivot tables.

import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx')

print(df.head())

#         Date Region                 Type  Units  Sales
# 0 2020-07-11   East  Children's Clothing   18.0    306
# 1 2020-09-23  North  Children's Clothing   14.0    448
# 2 2020-04-02  South     Women's Clothing   17.0    425
# 3 2020-02-28   East  Children's Clothing   26.0    832
# 4 2020-03-19   West     Women's Clothing    3.0     33

We can see that we have five columns, two of which are numeric, two are strings, and one is a date. Now, let’s take a look at a simple example and break down what’s happening:

# Grouping by Multiple Columns in Pandas
import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx')

grouped = df.groupby(['Region', 'Type']).sum()
print(grouped.head())

# Returns:
#                              Units  Sales
# Region Type                              
# East   Children's Clothing  2318.0  45849
#        Men's Clothing       2420.0  51685
#        Women's Clothing     3372.0  70229
# North  Children's Clothing  1763.0  37306
#        Men's Clothing          0.0  39975

Let’s break down what is happening in the code block above:

  1. We import Pandas and load our dataset
  2. We then create a new grouped DataFrame by passing in ['Region', 'Type'] into the .groupby() method
  3. Finally, we apply the .sum() method to calculate the sum for each aggregation

We can see that by passing in a list of multiple columns, we create a hierarchy in which columns are to be grouped. This is very important and determines the layers in which your data will be grouped.

Using GroupBy with Multiple Columns to Aggregate a Single Columns

In the example above, we used the Pandas .groupby() method to aggregate multiple columns. However, we aggregated all of the numeric columns. To use Pandas groupby with multiple columns and aggregate only a single column, we need to index that column.

Let’s see how we can do this and then break down how it works:

# Using GroupBy with Multiple Columns to Aggregate a Single Columns
import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx')

grouped = df.groupby(['Region', 'Type'])['Units'].sum()
print(grouped.head())

# Returns:
# Region  Type               
# East    Children's Clothing    2318.0
#         Men's Clothing         2420.0
#         Women's Clothing       3372.0
# North   Children's Clothing    1763.0
#         Men's Clothing            0.0

In the example above, we are grouping by multiple columns, but only aggregating the 'Units' column.

How to Use Multiple Aggregations with Pandas GroupBy with Multiple Columns

Similar to the example above, we can also use multiple aggregations when using the groupby method with multiple columns. In order to do this, we can pass a list into the .aggregate() method that defines the aggregations we want to use. Let’s see how we can calculate the minimum and maximum values for the Units columns:

# Multiple Aggregations with Pandas GroupBy with Multiple Columns
import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx')

grouped = df.groupby(['Region', 'Type'])['Units'].aggregate(['min', 'max'])
print(grouped.head())

# Returns:
#                             min   max
# Region Type                          
# East   Children's Clothing  3.0  35.0
#        Men's Clothing       3.0  35.0
#        Women's Clothing     3.0  35.0
# North  Children's Clothing  3.0  35.0
#        Men's Clothing       NaN   NaN

In order to use multiple aggregations, we have to use the .aggregate() method on our groupby object. This allows you to calculate multiple aggregations for the grouping you’re using. Finally, let’s take a look at how to use different aggregations per column when using Pandas groupby with multiple columns.

How to Use Different Aggregations Per Column When Using Pandas GroupBy with Multiple Columns

Pandas also allows you to use different aggregations per column when using groupby with multiple columns. In the example above, we used a list to pass multiple strings into the .aggregate() function. Similarly, we can pass in a dictionary that defined the aggregations we want to use per column.

Let’s take a look at an example and then dive into how this works:

# Different Aggregations Per Column When Using Pandas GroupBy with Multiple Columns
import pandas as pd
df = pd.read_excel(
    'https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx')

grouped = df.groupby(['Region', 'Type']).aggregate(
    {'Units': 'mean', 'Sales': ['min', 'max']})
print(grouped.head())

# Returns:
#                                 Units Sales      
#                                  mean   min   max
# Region Type                                      
# East   Children's Clothing  20.513274    36  1122
#        Men's Clothing       19.836066    36  1054
#        Women's Clothing     19.159091    52  1054
# North  Children's Clothing  20.741176    55  1088
#        Men's Clothing             NaN    66  1155

Let’s break down what we’re doing in the code above:

  1. We create our groupby object as before, grouping by the Region and Type fields
  2. We then apply the .aggregate() method to this groupby object
  3. In the .aggregate() method, we pass in a dictionary. The keys are the column labels and the values represent the aggregations we want to apply.

We can see that we can pass in a single aggregation, as we did for Units, or pass in a list of aggregations, as we did for Sales.

Conclusion

In this tutorial, you learned how to use Pandas groupby with multiple columns. The groupby method is an incredibly powerful and versatile method that allows you to aggregate values in a similar way to SQL GROUP BY statements.

You first learned how to use the .groupby() method with multiple columns. Then, you learned how to aggregate only a single column, when grouping by multiple columns. Then, you learned how to specify multiple aggregations for all columns. Finally, you learned how to specify different aggregations for each column when grouping by multiple columns.

Additional Resources

To learn more about related topics, check out the tutorials below:

Leave a Reply

Your email address will not be published. Required fields are marked *