In this post, we’ll explore how binning data in Python works with the cut() method in Pandas. In the past, we’ve explored how to use the describe() method to generate some descriptive statistics. In particular, the describe method allows us to see the quarter percentiles of a numerical column.
However, as we’re generating insight into our data, we may want to assign custom groupings. Let’s explore when we may want to bin data, and how we can do that in Pandas with the cut method.
Loading Sample Data
Let’s load a sample data set that contains two variables: age and gender.
For a more step-by-step introduction to what we’re doing below in loading the data, check out our other post here.
import pandas as pd dict = { 'age' : [7, 28, 19, 19, 37, 31, 32, 25, 10, 40, 17, 38, 51, 10, 36, 1, 38, 33, 4, 48, 23, 44, 6, 44, 19, 21, 53, 39, 18, 18, 4, 5, 7, 2, 49, 57, 13, 4, 3, 46, 20, 31, 11, 59, 58, 22, 31, 6, 44, 18, 40, 42, 46, 34, 23, 31, 57, 31, 57, 24, 33, 44, 24, 11, 37, 3, 16, 45, 26, 34, 4, 47, 42, 15, 45, 44, 35, 56, 40, 54, 9, 23, 15, 31, 1, 59, 15, 35, 52, 12, 34, 16, 59, 20, 6, 37, 26, 15, 2, 20, 2, 18, 30, 22, 50, 39, 58, 49, 23, 10, 44, 3, 30, 17, 18, 2, 41, 4, 31, 50, 1, 16, 26, 19, 11, 58, 15, 9, 31, 51, 3, 53, 12, 9, 49, 22, 55, 41, 14, 2, 41, 23, 13, 1, 34, 2, 32, 52, 1, 20, 48, 35, 6, 5, 35, 58, 20, 19, 17, 9, 20, 40, 52, 29, 40, 22, 51, 23, 2, 38, 35, 30, 29, 13, 55, 18, 29, 37, 25, 4, 43, 35, 24, 48, 8, 18, 40, 1, 43, 21, 33, 52, 36, 26, 26, 43, 35, 10, 44, 36, 47, 57, 38, 15, 22, 54, 36, 2, 44, 25, 55, 38, 5, 26, 47, 39, 32, 12, 37, 34, 4, 13, 9, 23, 34, 26, 17, 5, 55, 12, 42, 35, 27, 1, 31, 26, 10, 52, 36, 43, 12, 18, 57, 50, 52, 13, 59, 28, 21, 55, 28, 17, 33, 6, 30, 51, 18, 35, 44, 23, 26, 29, 50, 20, 7, 5, 47, 38, 9, 35, 16, 38, 26, 53, 30, 41, 1, 41, 56, 26, 37, 45, 16, 48, 30, 20, 36, 28, 4, 19, 34, 56, 51, 18, 48, 11, 17, 35, 1, 32, 38, 38, 20, 37, 59, 55, 29, 36, 5, 25, 1, 11, 34, 32, 33, 22, 47, 36, 59, 33, 24, 38, 11, 58, 15, 48, 5, 2, 16, 59, 49, 50, 23, 53, 21, 45, 48, 25, 49, 9, 20, 23, 10, 58, 17, 40, 25, 11, 12, 45], 'gender' : ['F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'F', 'F'] } df = pd.DataFrame(dict)
What a Normal Pivot Table Result Looks Like
We learned in a previous post how to pivot the data . An issue we’ll run into here, is that the resulting pivot table will still contain a lot of data.
pd.pivot_table(data = df, index = 'age', aggfunc = 'count').head()
age | gender |
---|---|
1 | 10 |
2 | 9 |
3 | 4 |
4 | 8 |
5 | 7 |
This is helpful if we want this level of detail. However, what if we wanted to learn about the count of people in each decade of age (10s, 20s, etc.).
Binning Data with Python
We could apply a complicated function via map and merge or we could make use of the .cut() method built into Pandas. This is the recommended process for binning data with Python!
The beautiful thing about the cut method is that we can define the boundaries of our bin edges ourselves. Unlike using the describe method, we’re not constrained to quarters.
Let’s begin by exploring the syntax of the method. The full documentation can be found here.
The general, full syntax looks like this:
pd.cut(x, bins, right: bool = True, labels=None, retbins: bool = False, precision: int = 3, include_lowest: bool = False, duplicates: str = 'raise')
For the purposes of this tutorial, we’ll focus only on a reduced version:
pd.cut(x, bins, labels = None)
Before we begin, it’ll be important for us to understand what the maximum and minimum values are in the age column.
print("Max age: ", df['age'].max()) print("Max age: ", df['age'].min())
Max age: 59 Min age: 1
We now know that we have ages that range from 1-59.
df['bin'] = pd.cut(df['age'], bins = [0,10,20,30,40,50,60]) df.head()
age | gender | bin | |
---|---|---|---|
0 | 7 | F | (0, 10] |
1 | 28 | F | (20, 30] |
2 | 19 | M | (10, 20] |
3 | 19 | M | (10, 20] |
4 | 37 | F | (30, 40] |
In this first example, we dropped the labels argument. What was returned were bin sizes that were numerically defined. What the formatting of each bin tells us is as below:
- ( or ) indicates that a number isn’t included
- [ or ] indicates that a number is included
For example, (0, 10] tells us that the bin includes values from, but not including 0, up to (and including) 10.
This, however, is still quite ugly. What if we want to include the value labels as decades. We can change our function to the following:
df['bin'] = pd.cut(df['age'], bins = [0,10,20,30,40,50,60], labels = ['0s', '10s', '20s', '30s', '40s', '50s']) df.head()
age | gender | bin | |
---|---|---|---|
0 | 7 | F | 0s |
1 | 28 | F | 20s |
2 | 19 | M | 10s |
3 | 19 | M | 10s |
4 | 37 | F | 30s |
Now, when we generate a pivot table, the data will be more manageable and easier to understand:
pd.pivot_table(data = df, index = 'bin', values = 'age', aggfunc = 'count')
age | |
---|---|
bin | |
0s | 60 |
10s | 65 |
20s | 57 |
30s | 74 |
40s | 50 |
50s | 44 |
Conclusion
In this post, we learned how binning data in Python works with Pandas using the cut method. The cut method allows us to easily group data and apply helpful labels to each bin.