Pandas Crosstab - The Complete Guide (w/ Examples) • datagy

The Pandas crosstab function is one of the many ways in which Pandas allows you to customize data. On the surface, it appears to be quite similar to the Pandas pivot table function, which I’ve covered extensively here. This post will give you a complete overview of how to best leverage the function.

The crosstab function gives you a few advantages over the pivot table function:

The summary can be normalized to show data as either a percentage of row or column totals, and
The data doesn’t have the be structured as a dataframe prior to analysis.

Table of Contents

Video Tutorial – Pandas Crosstab

Pandas Crosstab Function – An Overview

To begin, let’s explore the actual crosstab function:

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: str = 'All', dropna: bool = True, normalize=False)

Let’s take a closer look at these parameters:

Parameter	Default Value	Description
index		Values to group by in rows
columns		Values to group by in columns
values		Array of values to aggregate
rownames		What to name rows (list)
colnames		What to name columns (list)
aggfunc	count	How to aggregate data
margins	False	Add row/column totals
margins_name	“All”	The name for row/column totals
dropna	True	Do not include columns whose entries are all NaN
normalize	False	Normalize by dividing all values by the sum of values. Learn more in the section of normalizing.

Pandas Crosstab vs. Pandas Pivot Table

Much of what you can accomplish with a Pandas Crosstab, you can also accomplish with a Pandas Pivot Table. The key differences are:

The function does not require a dataframe as an input. It can also accept array-like objects for its rows and columns.
The function can normalize the resulting dataframe, meaning that the values displayed can be displayed as percentage of row or column totals.
The default function is len (count), whereas the pivot table one is numpy’s mean function.

Loading and Exploring our Sample Data Set

Let’s start off by loading our data set. We’ve built a free downloadable data set that can be found at this link. We’ll use Pandas to import the data into a dataframe called df. We’ll also print out the first five rows using the .head() function:

import pandas as pd

df = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx', parse_dates=['Date'])

print(df.head())

This returns:

Out:
        Date 	        Region 	Type 	                Units 	Sales
0 	2020-07-11 	East 	Children's Clothing 	18 	306
1 	2020-09-23 	North 	Children's Clothing 	14 	448
2 	2020-04-02 	South 	Women's Clothing 	17 	425
3 	2020-02-28 	East 	Children's Clothing 	26 	832
4 	2020-03-19 	West 	Women's Clothing 	3 	33

Let’s take a look at the columns available:

Column Name	Description
Date	Date of transaction
Region	The region of the transaction
Type	The type of clothing sold
Units	The number of units sold
Sales	The cost of the sale

Creating a Crosstab in Pandas

Let’s create our first crosstab in Pandas:

pd.crosstab(df.Region, df.Type)

# This is the same as:
# pd.crosstab(index = df.Region, columns = df.Type)

This returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing
Region
East                    113             122               176
North                    85              89               142
South                    45              39                53
West                     42              41                53

By default, Pandas will generate a crosstab which counts the number of times each item appears (the length of that series).

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Changing Pandas Crosstab Aggregation

If you wanted to change the type of aggregation used, you can apply the aggfunc parameter. To use the aggfunc parameter requires the values parameter to also be passed. If you wanted to get the mean of each sale, you could write:

pd.crosstab(index = df.Region, columns = df.Type, values = df.Sales, aggfunc = 'mean')

This returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing
Region
East             405.743363      423.647541        399.028409
North            438.894118      449.157303        432.528169
South            412.666667      475.435897        418.924528
West             480.523810      465.292683        419.188679

Adding Row and Column Labels

Pandas Crosstabs also allow you to add column or row labels. The rownames and colnames parameters control these, and accept lists. If you’ve added multiple rows or columns, the length of the list must match the length of the rows/columns being added.

Let’s change the names of both the rows and columns:

pd.crosstab(index = df.Region, columns = df.Type, values = df.Sales, aggfunc = 'mean', rownames=['Region Name'], colnames=['Clothing Type'])

This returns:

Clothing Type  Children's Clothing  Men's Clothing  Women's Clothing
Region Name
East                    405.743363      423.647541        399.028409
North                   438.894118      449.157303        432.528169
South                   412.666667      475.435897        418.924528
West                    480.523810      465.292683        419.188679

Creating a Multi-Index Pandas Crosstab

You can add multiple indices (rows) to a crosstab as well. This can be done by passing a list of variables to the crosstab function.

Say you wanted to break items down by region and quarter, you can pass these into the index parameter. We’ll use Pandas’ dt accessor to create a quarter variable on the fly:

pd.crosstab([df.Region, df.Date.dt.quarter], df.Type)

This returns:

Type         Children's Clothing  Men's Clothing  Women's Clothing
Region Date
East   1                      29              36                39
       2                      20              40                49
       3                      34              19                44
       4                      30              27                44
North  1                      11              23                36
       2                      24              24                38
       3                      31              19                40
       4                      19              23                28
South  1                      10              14                14
       2                      11               8                16
       3                      11               8                16
       4                      13               9                 7
West   1                      15              13                12
       2                       9               7                11
       3                       7               7                12
       4                      11              14                18

As explained earlier, you can change the names of the labels. Let’s change them to better reflect that the data represents quarters:

pd.crosstab([df.Region, df.Date.dt.quarter], df.Type, rownames=['Region', 'Quarter'])

This returns:

Type            Children's Clothing  Men's Clothing  Women's Clothing
Region Quarter
East   1                         29              36                39
       2                         20              40                49
       3                         34              19                44
       4                         30              27                44
North  1                         11              23                36
       2                         24              24                38
       3                         31              19                40
       4                         19              23                28
South  1                         10              14                14
       2                         11               8                16
       3                         11               8                16
       4                         13               9                 7
West   1                         15              13                12
       2                          9               7                11
       3                          7               7                12
       4                         11              14                18

Adding Columns to a Pandas Crosstab

Similar to adding multiple rows, you can also add multiple columns. Let’s repeat the example above and break the data out by Type and by Quarter as columns:

pd.crosstab(df.Region, [df.Type, df.Date.dt.quarter])

This returns:

Type   Children's Clothing             Men's Clothing             Women's Clothing
Date                     1   2   3   4              1   2   3   4                1   2   3   4
Region
East                    29  20  34  30             36  40  19  27               39  49  44  44
North                   11  24  31  19             23  24  19  23               36  38  40  28
South                   10  11  11  13             14   8   8   9               14  16  16   7
West                    15   9   7  11             13   7   7  14               12  11  12  18

Similarly, you can change the row names by using the colnames argument. Let’s make them a little more appropriate:

pd.crosstab(df.Region, [df.Type, df.Date.dt.quarter], colnames=['Type', 'Quarter'])

This returns:

Type    Children's Clothing             Men's Clothing             Women's Clothing
Quarter                   1   2   3   4              1   2   3   4                1   2   3   4
Region
East                     29  20  34  30             36  40  19  27               39  49  44  44
North                    11  24  31  19             23  24  19  23               36  38  40  28
South                    10  11  11  13             14   8   8   9               14  16  16   7
West                     15   9   7  11             13   7   7  14               12  11  12  18

Normalize a Pandas Crosstab for Row/Column Percentages

A key benefit of the crosstab function over the Pandas Pivot Table function is that it allows you to normalize the resulting dataframe, returning values displayed as percentages.

This goes one step further – the normalize argument accepts a number of different options:

‘all’ or True – normalizes the values across the entire dataframe (as a percentage of the total across rows and columns)
‘index’ – normalizes across rows
‘columns’ – normalizes down columns

If the margins argument is set to True, the totals will also be normalized.

Let’s try this out – let’s first use the index value to normalize our rows:

pd.crosstab(df.Region, df.Type, normalize='index')

This returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing
Region
East               0.274939        0.296837          0.428224
North              0.268987        0.281646          0.449367
South              0.328467        0.284672          0.386861
West               0.308824        0.301471          0.389706

The sum of each row is 1.

Now, let’s try to normalize across everything:

pd.crosstab(df.Region, df.Type, normalize='all')

This returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing
Region
East                  0.113           0.122             0.176
North                 0.085           0.089             0.142
South                 0.045           0.039             0.053
West                  0.042           0.041             0.053

With this instance, the values are divided by the sum of all values across the dataframe (i.e., percentage of table total).

Plotting Pandas Crosstabs

Since Pandas crosstabs are dataframes, you can use Pandas plotting functionality to easily plot out data. Let’s give this a shot, using the normalized crosstabs we made above:

pd.crosstab(df.Region, df.Type, normalize='index').plot.bar(stacked=True)

This returns:

Adding Totals to Pandas Crosstabs (Margins)

It’s possible to add row and column totals to your crosstab easily. This is done using the margins argument. To turn on totals, add the True value to the parameter:

pd.crosstab(df.Region, df.Type, margins=True)

This returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing   All
Region
East                    113             122               176   411
North                    85              89               142   316
South                    45              39                53   137
West                     42              41                53   136
All                     285             291               424  1000

You can change the name of the column and row, you can use the margins_name argument. Let’s change ‘All’ to ‘Total’:

pd.crosstab(df.Region, df.Type, margins=True, margins_name = 'Total')

This returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing  Total
Region
East                    113             122               176    411
North                    85              89               142    316
South                    45              39                53    137
West                     42              41                53    136
Total                   285             291               424   1000

If you’ve normalized your crosstab, the margins will be included:

pd.crosstab(df.Region, df.Type, margins=True, margins_name='Total', normalize='index')

Which returns:

Type    Children's Clothing  Men's Clothing  Women's Clothing
Region
East               0.274939        0.296837          0.428224
North              0.268987        0.281646          0.449367
South              0.328467        0.284672          0.386861
West               0.308824        0.301471          0.389706
Total              0.285000        0.291000          0.424000

Conclusion

In this post, you learned all you need to know about the Pandas crosstab function, including what the different parameters mean, how to add multiple indices and columns, how to normalize your crosstab, how to add totals, and how to plot your resulting crosstab.

To learn more about the crosstab function, check out the official documentation.

3 thoughts on “Pandas Crosstab – Everything You Need to Know”

Tonnie June 23, 2022 at 5:39 am


is it possible te put 2 values in a column under 2 levels in a crosstab.
example
2022
M F
age income age income
25 1000 35 1500
Kyle August 10, 2022 at 8:48 pm


Thanks, great tutorial, exactly what I was looking for!
1. Nik August 11, 2022 at 6:37 am
  
  
  Thanks Kyle!

Pandas Crosstab – Everything You Need to Know

Video Tutorial – Pandas Crosstab

Pandas Crosstab Function – An Overview

Pandas Crosstab vs. Pandas Pivot Table

Loading and Exploring our Sample Data Set

Creating a Crosstab in Pandas

Changing Pandas Crosstab Aggregation

Adding Row and Column Labels

Creating a Multi-Index Pandas Crosstab

Adding Columns to a Pandas Crosstab

Normalize a Pandas Crosstab for Row/Column Percentages

Plotting Pandas Crosstabs

Adding Totals to Pandas Crosstabs (Margins)

Conclusion

Nik Piepenbreier

3 thoughts on “Pandas Crosstab – Everything You Need to Know”

Leave a Reply Cancel reply

Pandas Crosstab – Everything You Need to Know

Video Tutorial – Pandas Crosstab

Pandas Crosstab Function – An Overview

Pandas Crosstab vs. Pandas Pivot Table

Loading and Exploring our Sample Data Set

Creating a Crosstab in Pandas

Changing Pandas Crosstab Aggregation

Adding Row and Column Labels

Creating a Multi-Index Pandas Crosstab

Adding Columns to a Pandas Crosstab

Normalize a Pandas Crosstab for Row/Column Percentages

Plotting Pandas Crosstabs

Adding Totals to Pandas Crosstabs (Margins)

Conclusion

Nik Piepenbreier

3 thoughts on “Pandas Crosstab – Everything You Need to Know”

Leave a Reply Cancel reply

Thank you!