Binning Data in Python with Pandas’ cut()

  • by
Binning data with pandas - exploring the cut method
  • Save

In this post, we’ll explore how binning data in Python works with the cut() method in Pandas. In the past, we’ve explored how to use the describe() method to generate some descriptive statistics. In particular, the describe method allows us to see the quarter percentiles of a numerical column.

However, as we’re generating insight into our data, we may want to assign custom groupings. Let’s explore when we may want to bin data, and how we can do that in Pandas with the cut method.

Loading Sample Data

Let’s load a sample data set that contains two variables: age and gender.

For a more step-by-step introduction to what we’re doing below in loading the data, check out our other post here.

import pandas as pd
dict = {
    'age' : [7, 28, 19, 19, 37, 31, 32, 25, 10, 40, 17, 38, 51, 10, 36, 1, 38, 33, 4, 48, 23, 44, 6, 44, 19, 21, 53, 39, 18, 18, 4, 5, 7, 2, 49, 57, 13, 4, 3, 46, 20, 31, 11, 59, 58, 22, 31, 6, 44, 18, 40, 42, 46, 34, 23, 31, 57, 31, 57, 24, 33, 44, 24, 11, 37, 3, 16, 45, 26, 34, 4, 47, 42, 15, 45, 44, 35, 56, 40, 54, 9, 23, 15, 31, 1, 59, 15, 35, 52, 12, 34, 16, 59, 20, 6, 37, 26, 15, 2, 20, 2, 18, 30, 22, 50, 39, 58, 49, 23, 10, 44, 3, 30, 17, 18, 2, 41, 4, 31, 50, 1, 16, 26, 19, 11, 58, 15, 9, 31, 51, 3, 53, 12, 9, 49, 22, 55, 41, 14, 2, 41, 23, 13, 1, 34, 2, 32, 52, 1, 20, 48, 35, 6, 5, 35, 58, 20, 19, 17, 9, 20, 40, 52, 29, 40, 22, 51, 23, 2, 38, 35, 30, 29, 13, 55, 18, 29, 37, 25, 4, 43, 35, 24, 48, 8, 18, 40, 1, 43, 21, 33, 52, 36, 26, 26, 43, 35, 10, 44, 36, 47, 57, 38, 15, 22, 54, 36, 2, 44, 25, 55, 38, 5, 26, 47, 39, 32, 12, 37, 34, 4, 13, 9, 23, 34, 26, 17, 5, 55, 12, 42, 35, 27, 1, 31, 26, 10, 52, 36, 43, 12, 18, 57, 50, 52, 13, 59, 28, 21, 55, 28, 17, 33, 6, 30, 51, 18, 35, 44, 23, 26, 29, 50, 20, 7, 5, 47, 38, 9, 35, 16, 38, 26, 53, 30, 41, 1, 41, 56, 26, 37, 45, 16, 48, 30, 20, 36, 28, 4, 19, 34, 56, 51, 18, 48, 11, 17, 35, 1, 32, 38, 38, 20, 37, 59, 55, 29, 36, 5, 25, 1, 11, 34, 32, 33, 22, 47, 36, 59, 33, 24, 38, 11, 58, 15, 48, 5, 2, 16, 59, 49, 50, 23, 53, 21, 45, 48, 25, 49, 9, 20, 23, 10, 58, 17, 40, 25, 11, 12, 45],
    'gender' : ['F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'F']
} 
df = pd.DataFrame(dict)

What a Normal Pivot Table Result Looks Like

We learned in a previous post how to pivot the data . An issue we’ll run into here, is that the resulting pivot table will still contain a lot of data.

pd.pivot_table(data = df, index = 'age', aggfunc = 'count').head()
age gender
110
29
34
48
57

This is helpful if we want this level of detail. However, what if we wanted to learn about the count of people in each decade of age (10s, 20s, etc.).

Binning Data with Python

We could apply a complicated function via map and merge or we could make use of the .cut() method built into Pandas. This is the recommended process for binning data with Python!

The beautiful thing about the cut method is that we can define the boundaries of our bin edges ourselves. Unlike using the describe method, we’re not constrained to quarters.

Let’s begin by exploring the syntax of the method. The full documentation can be found here.

The general, full syntax looks like this:

pd.cut(x, bins, right: bool = True, labels=None, retbins:  bool = False, precision: int = 3, include_lowest: bool = False,  duplicates: str = 'raise')

For the purposes of this tutorial, we’ll focus only on a reduced version:

pd.cut(x, bins, labels = None)

Before we begin, it’ll be important for us to understand what the maximum and minimum values are in the age column.

print("Max age: ", df['age'].max())
print("Max age: ", df['age'].min())
Max age:  59
Min age:  1

We now know that we have ages that range from 1-59.

df['bin'] = pd.cut(df['age'], bins = [0,10,20,30,40,50,60])
df.head()
agegenderbin
07F(0, 10]
128F(20, 30]
219M(10, 20]
319M(10, 20]
437F(30, 40]

In this first example, we dropped the labels argument. What was returned were bin sizes that were numerically defined. What the formatting of each bin tells us is as below:

  • ( or ) indicates that a number isn’t included
  • [ or ] indicates that a number is included

For example, (0, 10] tells us that the bin includes values from, but not including 0, up to (and including) 10.

This, however, is still quite ugly. What if we want to include the value labels as decades. We can change our function to the following:

df['bin'] = pd.cut(df['age'], bins = [0,10,20,30,40,50,60], labels = ['0s', '10s', '20s', '30s', '40s', '50s'])
df.head()
agegenderbin
07F0s
128F20s
219M10s
319M10s
437F30s

Now, when we generate a pivot table, the data will be more manageable and easier to understand:

pd.pivot_table(data = df, index = 'bin', values = 'age', aggfunc = 'count')
age
bin
0s60
10s65
20s57
30s74
40s50
50s44

Conclusion

In this post, we learned how binning data in Python works with Pandas using the cut method. The cut method allows us to easily group data and apply helpful labels to each bin.