In this tutorial, you’ll learn how to use the Pandas value_counts method to count values in your DataFrame and to create frequency tables. Being able to count values in your dataset is an important step in better understanding the distribution of your data. The Pandas .value_counts()
method is packed full of helpful parameters to allow you to fine-tune your analysis.
By the end of this tutorial, you’ll have learned:
- How to use the Pandas
.value_counts()
method - How to create absolute and relative frequencies with the method
- How to handle missing data
- How to simplify binning of a column
- How to combine the
.value_counts()
method with the.groupby()
method to explore frequencies across different groups of data
Table of Contents
Exploring the Pandas value_counts Method
The Pandas value_counts()
method can be applied to both a DataFrame column or to an entire DataFrame. The behavior varies slightly between the two methods. However, for the most part, the methods will work exactly the same. There’ll be a note indicating if there are differences between the two methods.
Let’s take a look at the how the method looks and the default arguments that exist:
# The value_counts() Method Explained
.value_counts(
normalize=False, # Whether to return relative frequencies
sort=True, # Sort by frequencies
ascending=False, # Sort in ascending order
bins=None, # Group them into half-open bins
dropna=True # How to deal with missing values
)
The method has only optional parameters, meaning if you simply want to calculate value counts you can apply the method directly without needing to worry about any arguments being passed in.
Loading a Sample Pandas DataFrame
To follow along with this tutorial, load the dataset below by copying and pasting the provided code. If you have your own data, feel free to use that dataset but your results will, of course, vary.
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/students.csv')
print(df.head())
# Returns:
# Grade Class Type Students
# 0 10 Beginner 35
# 1 11 Intermediate 20
# 2 10 Beginner 15
# 3 12 Advanced 15
# 4 9 Intermediate 30
The dataset only has three columns, two of which can be considered categorical. The data covers different grades and class types, as well as indicating how many students are in that grade and level.
Creating a Pandas Frequency Table with value_counts
In this section, you’ll learn how to apply the Pandas .value_counts()
method to a Pandas column. For example, if you wanted to count the number of times each unique value appears in the Students column, you can simply apply the function onto that column.
Let’s see how this looks without passing in any arguments, which will be covered off later in the tutorial.
# Calculating a Frequency Table of a DataFrame Column
print(df['Students'].value_counts())
# Returns:
# 20 32
# 30 23
# 25 16
# 15 12
# 35 10
# 40 7
# Name: Students, dtype: int64
The method returns a Series containing all the different values. Because the return type is a Series, you can easily access the data inside it using indexing. For example, if you wanted to see how often there were 20 students in a class, you could access the [20]
index:
# Accessing the counts of one category
print(df['Students'].value_counts()[20])
# Returns: 32
In the next section, you’ll learn how to sort your Pandas frequency table.
Sorting Your Pandas Frequency Table
Sorting a frequency table generated by the Pandas value_counts method is controlled by two different parameters. First, the sort=
parameter controls whether to sort data. The ascending=
parameter controls how that data is sorted. By default, Pandas will sort the data in descending order.
We can modify this behavior to sort in descending order by modifying the ascending=
parameter to False
. This can be helpful if you want to get a sense of the least frequent values in that dataset.
# Sorting value counts in ascending order
print(df['Students'].value_counts(ascending=True))
# Returns:
# 40 7
# 35 10
# 15 12
# 25 16
# 30 23
# 20 32
# Name: Students, dtype: int64
In this case, our least frequent values appears higher in the returned Series.
By toggling sorting off entirely, data are simply ordered in the order in which Pandas encounters them in your dataset. In the next section, you’ll learn how to sort the returned values by their label.
Sorting Your Pandas Frequency Table by Index
In the previous section, you learned that the .value_counts()
method returns a Series that can be sorted or unsorted. There may be times, however, when you want to sort the Series by the labels of the data (i.e., its category), rather than the frequencies.
This can be accomplished by chaining the .sort_index()
method onto the returned Series. The data can either be sorted in ascending order or descending order, by using the ascending=
parameter. Let’s sort our frequency table by their categories in ascending order.
# Sorting frequency table by its index
print(df['Students'].value_counts().sort_index(ascending=True))
# Returns:
# 15 12
# 20 32
# 25 16
# 30 23
# 35 10
# 40 7
# Name: Students, dtype: int64
In the next section, you’ll learn how to calculate a Pandas value counts table that uses normalized percentages, rather than values.
Calculating a Pandas Frequecy Table with Percentages
Pandas also makes it very easy to display these frequencies as percentages. This can make the data much simpler to understand. In order to transform the counts into percentages, you can use the normalize=
parameter. By default, the argument is set to False
, but toggling it to True
converts the values:
# Showing percentages of value counts
print(df['Students'].value_counts(normalize=True))
# Returns:
# 20 0.32
# 30 0.23
# 25 0.16
# 15 0.12
# 35 0.10
# 40 0.07
# Name: Students, dtype: float64
Remember, by default, Pandas will sort the values in descending order. Normalizing the data allows you to more easily how much of a dataset the category represents.
Binning Data in Pandas Frequency Tables
The Pandas value_counts
method can also be used to bin data into different equal sized groups. This method is a convenience function for the Pandas .cut()
method, and provides the number of values in each group.
Let’s split the data into four numeric groups using the bin=
parameter:
# Binning data with Pandas value_counts
print(df['Students'].value_counts(bins=4))
# Returns:
# (14.974, 21.25] 44
# (27.5, 33.75] 23
# (33.75, 40.0] 17
# (21.25, 27.5] 16
# Name: Students, dtype: int64
Dealing with Missing Values in Pandas Frequency Tables
In this section, you’ll learn how to work with missing data while working with the .value_counts()
method. In order to cover this section off in better detail, let’s insert some missing values into our DataFrame.
# Inserting missing values into the DataFrame
df.loc[:10, 'Students'] = None
print(df.head())
# Returns:
# Grade Class Type Students
# 0 10 Beginner NaN
# 1 11 Intermediate NaN
# 2 10 Beginner NaN
# 3 12 Advanced NaN
# 4 9 Intermediate NaN
Now that we have missing values in our DataFrame, let’s apply the method with its default parameters and see how the results look:
# Seeing value counts
print(df['Students'].value_counts())
# Returns:
# 20.0 27
# 30.0 21
# 25.0 16
# 15.0 10
# 35.0 8
# 40.0 7
# Name: Students, dtype: int64
By default, the method will drop any missing values. It can often be useful to include these values. This can be done by passing in True
into the dropna=
parameter.
# Including Missing Values in the value_counts Method
print(df['Students'].value_counts(dropna=False))
# Returns:
# 20.0 27
# 30.0 21
# 25.0 16
# NaN 11
# 15.0 10
# 35.0 8
# 40.0 7
# Name: Students, dtype: int64
Calculating a Frequency Table on Multiple Columns with value_counts
The .value_counts()
can also be applied the multiple columns. The benefit of applying the method to the entire DataFrame is that you gain access to the subset=
parameter. This allows you to pass in a list of columns, which will return the values in the cross-section of columns. Let’s see what the distribution of values is across the Grade and Class Type columns.
# Applying .value_counts() to multiple columns
print(df.value_counts(subset=['Class Type', 'Grade']))
# Returns:
# Class Type Grade
# Beginner 11 14
# Advanced 10 11
# Intermediate 10 10
# 12 10
# 9 9
# Advanced 9 7
# 11 7
# Beginner 10 7
# Intermediate 11 7
# Advanced 12 6
# Beginner 9 6
# 12 6
# dtype: int64
Right now, this returns a fairly messy Series. You can clean this up by sorting the index to better understand the hierarchy.
# Sorting the index after applying .value_counts to a DataFrame
print(df.value_counts(subset=['Class Type', 'Grade']).sort_index())
# Returns:
# Class Type Grade
# Advanced 9 7
# 10 11
# 11 7
# 12 6
# Beginner 9 6
# 10 7
# 11 14
# 12 6
# Intermediate 9 9
# 10 10
# 11 7
# 12 10
# dtype: int64
Exercises
It’s time to check your learning! Try and complete the exercises below. To solve the exercises, use the DataFrame provided below:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/students.csv')
df.loc[:10, 'Students'] = None
To verify your solution, simply toggle the question.
The .idxmax() method returns the index of the maximum value in a Series or DataFrame. Use the method to find the Grade that appears most often.
print(df['Grade'].value_counts().idxmax())
# Returns: 10
What percentage of values in the Students column are missing?
print(df['Students'].isna().value_counts(normalize=True)[1])
# Returns: 0.11
How often does the second-highest number of students appear in the dataset?
print(df['Students'].value_counts().iloc[1])
# Returns: 21
Conclusion and Recap
In this tutorial, you learned how to use the Pandas value_counts method to calculate a frequency table counting the values in a Series or DataFrame. The section below provides a recap of what you learned:
- The
value_counts()
method can be applied to either a Pandas Series or DataFrame - The method counts the number of times a value appears
- The method can convert the values into a normalized percentage, using the
normalize=True
argument - The method can be applied to multiple columns to establish a hierarchy between columns
Hello,
I get a name ‘NaN’ is not defined NameError when I execute the answer from the second question.
Thanks for letting me know! I have updated the code.