The Pandas crosstab function is one of the many ways in which Pandas allows you to customize data. On the surface, it appears to be quite similar to the Pandas pivot table function, which I’ve covered extensively here. This post will give you a complete overview of how to best leverage the function.
The crosstab function gives you a few advantages over the pivot table function:
- The summary can be normalized to show data as either a percentage of row or column totals, and
- The data doesn’t have the be structured as a dataframe prior to analysis.
Table of Contents
Video Tutorial – Pandas Crosstab
Pandas Crosstab Function – An Overview
To begin, let’s explore the actual crosstab function:
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name: str = 'All', dropna: bool = True, normalize=False)
Let’s take a closer look at these parameters:
Parameter | Default Value | Description |
---|---|---|
index | Values to group by in rows | |
columns | Values to group by in columns | |
values | Array of values to aggregate | |
rownames | What to name rows (list) | |
colnames | What to name columns (list) | |
aggfunc | count | How to aggregate data |
margins | False | Add row/column totals |
margins_name | “All” | The name for row/column totals |
dropna | True | Do not include columns whose entries are all NaN |
normalize | False | Normalize by dividing all values by the sum of values. Learn more in the section of normalizing. |
Pandas Crosstab vs. Pandas Pivot Table
Much of what you can accomplish with a Pandas Crosstab, you can also accomplish with a Pandas Pivot Table. The key differences are:
- The function does not require a dataframe as an input. It can also accept array-like objects for its rows and columns.
- The function can normalize the resulting dataframe, meaning that the values displayed can be displayed as percentage of row or column totals.
- The default function is len (count), whereas the pivot table one is numpy’s mean function.
Loading and Exploring our Sample Data Set
Let’s start off by loading our data set. We’ve built a free downloadable data set that can be found at this link. We’ll use Pandas to import the data into a dataframe called df. We’ll also print out the first five rows using the .head() function:
import pandas as pd
df = pd.read_excel('https://github.com/datagy/mediumdata/raw/master/sample_pivot.xlsx', parse_dates=['Date'])
print(df.head())
This returns:
Out:
Date Region Type Units Sales
0 2020-07-11 East Children's Clothing 18 306
1 2020-09-23 North Children's Clothing 14 448
2 2020-04-02 South Women's Clothing 17 425
3 2020-02-28 East Children's Clothing 26 832
4 2020-03-19 West Women's Clothing 3 33
Let’s take a look at the columns available:
Column Name | Description |
---|---|
Date | Date of transaction |
Region | The region of the transaction |
Type | The type of clothing sold |
Units | The number of units sold |
Sales | The cost of the sale |
Creating a Crosstab in Pandas
Let’s create our first crosstab in Pandas:
pd.crosstab(df.Region, df.Type)
# This is the same as:
# pd.crosstab(index = df.Region, columns = df.Type)
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region
East 113 122 176
North 85 89 142
South 45 39 53
West 42 41 53
By default, Pandas will generate a crosstab which counts the number of times each item appears (the length of that series).
Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!
Changing Pandas Crosstab Aggregation
If you wanted to change the type of aggregation used, you can apply the aggfunc parameter. To use the aggfunc parameter requires the values parameter to also be passed. If you wanted to get the mean of each sale, you could write:
pd.crosstab(index = df.Region, columns = df.Type, values = df.Sales, aggfunc = 'mean')
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region
East 405.743363 423.647541 399.028409
North 438.894118 449.157303 432.528169
South 412.666667 475.435897 418.924528
West 480.523810 465.292683 419.188679
Adding Row and Column Labels
Pandas Crosstabs also allow you to add column or row labels. The rownames and colnames parameters control these, and accept lists. If you’ve added multiple rows or columns, the length of the list must match the length of the rows/columns being added.
Let’s change the names of both the rows and columns:
pd.crosstab(index = df.Region, columns = df.Type, values = df.Sales, aggfunc = 'mean', rownames=['Region Name'], colnames=['Clothing Type'])
This returns:
Clothing Type Children's Clothing Men's Clothing Women's Clothing
Region Name
East 405.743363 423.647541 399.028409
North 438.894118 449.157303 432.528169
South 412.666667 475.435897 418.924528
West 480.523810 465.292683 419.188679
Creating a Multi-Index Pandas Crosstab
You can add multiple indices (rows) to a crosstab as well. This can be done by passing a list of variables to the crosstab function.
Say you wanted to break items down by region and quarter, you can pass these into the index parameter. We’ll use Pandas’ dt accessor to create a quarter variable on the fly:
pd.crosstab([df.Region, df.Date.dt.quarter], df.Type)
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region Date
East 1 29 36 39
2 20 40 49
3 34 19 44
4 30 27 44
North 1 11 23 36
2 24 24 38
3 31 19 40
4 19 23 28
South 1 10 14 14
2 11 8 16
3 11 8 16
4 13 9 7
West 1 15 13 12
2 9 7 11
3 7 7 12
4 11 14 18
As explained earlier, you can change the names of the labels. Let’s change them to better reflect that the data represents quarters:
pd.crosstab([df.Region, df.Date.dt.quarter], df.Type, rownames=['Region', 'Quarter'])
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region Quarter
East 1 29 36 39
2 20 40 49
3 34 19 44
4 30 27 44
North 1 11 23 36
2 24 24 38
3 31 19 40
4 19 23 28
South 1 10 14 14
2 11 8 16
3 11 8 16
4 13 9 7
West 1 15 13 12
2 9 7 11
3 7 7 12
4 11 14 18
Adding Columns to a Pandas Crosstab
Similar to adding multiple rows, you can also add multiple columns. Let’s repeat the example above and break the data out by Type and by Quarter as columns:
pd.crosstab(df.Region, [df.Type, df.Date.dt.quarter])
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Date 1 2 3 4 1 2 3 4 1 2 3 4
Region
East 29 20 34 30 36 40 19 27 39 49 44 44
North 11 24 31 19 23 24 19 23 36 38 40 28
South 10 11 11 13 14 8 8 9 14 16 16 7
West 15 9 7 11 13 7 7 14 12 11 12 18
Similarly, you can change the row names by using the colnames argument. Let’s make them a little more appropriate:
pd.crosstab(df.Region, [df.Type, df.Date.dt.quarter], colnames=['Type', 'Quarter'])
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Quarter 1 2 3 4 1 2 3 4 1 2 3 4
Region
East 29 20 34 30 36 40 19 27 39 49 44 44
North 11 24 31 19 23 24 19 23 36 38 40 28
South 10 11 11 13 14 8 8 9 14 16 16 7
West 15 9 7 11 13 7 7 14 12 11 12 18
Normalize a Pandas Crosstab for Row/Column Percentages
A key benefit of the crosstab function over the Pandas Pivot Table function is that it allows you to normalize the resulting dataframe, returning values displayed as percentages.
This goes one step further – the normalize argument accepts a number of different options:
- ‘all’ or True – normalizes the values across the entire dataframe (as a percentage of the total across rows and columns)
- ‘index’ – normalizes across rows
- ‘columns’ – normalizes down columns
If the margins argument is set to True, the totals will also be normalized.
Let’s try this out – let’s first use the index value to normalize our rows:
pd.crosstab(df.Region, df.Type, normalize='index')
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region
East 0.274939 0.296837 0.428224
North 0.268987 0.281646 0.449367
South 0.328467 0.284672 0.386861
West 0.308824 0.301471 0.389706
The sum of each row is 1.
Now, let’s try to normalize across everything:
pd.crosstab(df.Region, df.Type, normalize='all')
This returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region
East 0.113 0.122 0.176
North 0.085 0.089 0.142
South 0.045 0.039 0.053
West 0.042 0.041 0.053
With this instance, the values are divided by the sum of all values across the dataframe (i.e., percentage of table total).
Plotting Pandas Crosstabs
Since Pandas crosstabs are dataframes, you can use Pandas plotting functionality to easily plot out data. Let’s give this a shot, using the normalized crosstabs we made above:
pd.crosstab(df.Region, df.Type, normalize='index').plot.bar(stacked=True)
This returns:
Adding Totals to Pandas Crosstabs (Margins)
It’s possible to add row and column totals to your crosstab easily. This is done using the margins argument. To turn on totals, add the True value to the parameter:
pd.crosstab(df.Region, df.Type, margins=True)
This returns:
Type Children's Clothing Men's Clothing Women's Clothing All
Region
East 113 122 176 411
North 85 89 142 316
South 45 39 53 137
West 42 41 53 136
All 285 291 424 1000
You can change the name of the column and row, you can use the margins_name argument. Let’s change ‘All’ to ‘Total’:
pd.crosstab(df.Region, df.Type, margins=True, margins_name = 'Total')
This returns:
Type Children's Clothing Men's Clothing Women's Clothing Total
Region
East 113 122 176 411
North 85 89 142 316
South 45 39 53 137
West 42 41 53 136
Total 285 291 424 1000
If you’ve normalized your crosstab, the margins will be included:
pd.crosstab(df.Region, df.Type, margins=True, margins_name='Total', normalize='index')
Which returns:
Type Children's Clothing Men's Clothing Women's Clothing
Region
East 0.274939 0.296837 0.428224
North 0.268987 0.281646 0.449367
South 0.328467 0.284672 0.386861
West 0.308824 0.301471 0.389706
Total 0.285000 0.291000 0.424000
Conclusion
In this post, you learned all you need to know about the Pandas crosstab function, including what the different parameters mean, how to add multiple indices and columns, how to normalize your crosstab, how to add totals, and how to plot your resulting crosstab.
To learn more about the crosstab function, check out the official documentation.
is it possible te put 2 values in a column under 2 levels in a crosstab.
example
2022
M F
age income age income
25 1000 35 1500
Thanks, great tutorial, exactly what I was looking for!
Thanks Kyle!