Counting Values in Pandas with value

In this tutorial, you’ll learn how to use the Pandas value_counts method to count values in your DataFrame and to create frequency tables. Being able to count values in your dataset is an important step in better understanding the distribution of your data. The Pandas .value_counts() method is packed full of helpful parameters to allow you to fine-tune your analysis.

By the end of this tutorial, you’ll have learned:

How to use the Pandas .value_counts() method
How to create absolute and relative frequencies with the method
How to handle missing data
How to simplify binning of a column
How to combine the .value_counts() method with the .groupby() method to explore frequencies across different groups of data

Table of Contents

Exploring the Pandas value_counts Method

The Pandas value_counts() method can be applied to both a DataFrame column or to an entire DataFrame. The behavior varies slightly between the two methods. However, for the most part, the methods will work exactly the same. There’ll be a note indicating if there are differences between the two methods.

Let’s take a look at the how the method looks and the default arguments that exist:

# The value_counts() Method Explained
.value_counts(
    normalize=False,        # Whether to return relative frequencies
    sort=True,              # Sort by frequencies
    ascending=False,        # Sort in ascending order
    bins=None,              # Group them into half-open bins 
    dropna=True             # How to deal with missing values
)

The method has only optional parameters, meaning if you simply want to calculate value counts you can apply the method directly without needing to worry about any arguments being passed in.

Loading a Sample Pandas DataFrame

To follow along with this tutorial, load the dataset below by copying and pasting the provided code. If you have your own data, feel free to use that dataset but your results will, of course, vary.

# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/students.csv')
print(df.head())

# Returns:
#    Grade    Class Type  Students
# 0     10      Beginner        35
# 1     11  Intermediate        20
# 2     10      Beginner        15
# 3     12      Advanced        15
# 4      9  Intermediate        30

The dataset only has three columns, two of which can be considered categorical. The data covers different grades and class types, as well as indicating how many students are in that grade and level.

Creating a Pandas Frequency Table with value_counts

In this section, you’ll learn how to apply the Pandas .value_counts() method to a Pandas column. For example, if you wanted to count the number of times each value appears in the Students column, you can simply apply the function onto that column.

Let’s see how this looks without passing in any arguments, which will be covered off later in the tutorial.

# Calculating a Frequency Table of a DataFrame Column
print(df['Students'].value_counts())

# Returns:
# 20    32
# 30    23
# 25    16
# 15    12
# 35    10
# 40     7
# Name: Students, dtype: int64

The method returns a Series containing all the different values. Because the return type is a Series, you can easily access the data inside it using indexing. For example, if you wanted to see how often there were 20 students in a class, you could access the [20] index:

# Accessing the counts of one category
print(df['Students'].value_counts()[20])

# Returns: 32

In the next section, you’ll learn how to sort your Pandas frequency table.

Sorting Your Pandas Frequency Table

Sorting a frequency table generated by the Pandas value_counts method is controlled by two different parameters. First, the sort= parameter controls whether to sort data. The ascending= parameter controls how that data is sorted. By default, Pandas will sort the data in descending order.

We can modify this behavior to sort in descending order by modifying the ascending= parameter to False. This can be helpful if you want to get a sense of the least frequent values in that dataset.

# Sorting value counts in ascending order
print(df['Students'].value_counts(ascending=True))

# Returns: 
# 40     7
# 35    10
# 15    12
# 25    16
# 30    23
# 20    32
# Name: Students, dtype: int64

In this case, our least frequent values appears higher in the returned Series.

By toggling sorting off entirely, data are simply ordered in the order in which Pandas encounters them in your dataset. In the next section, you’ll learn how to sort the returned values by their label.

Sorting Your Pandas Frequency Table by Index

In the previous section, you learned that the .value_counts() method returns a Series that can be sorted or unsorted. There may be times, however, when you want to sort the Series by the labels of the data (i.e., its category), rather than the frequencies.

This can be accomplished by chaining the .sort_index() method onto the returned Series. The data can either be sorted in ascending order or descending order, by using the ascending= parameter. Let’s sort our frequency table by their categories in ascending order.

# Sorting frequency table by its index
print(df['Students'].value_counts().sort_index(ascending=True))

# Returns: 
# 15    12
# 20    32
# 25    16
# 30    23
# 35    10
# 40     7
# Name: Students, dtype: int64

In the next section, you’ll learn how to calculate a Pandas value counts table that uses normalized percentages, rather than values.

Calculating a Pandas Frequecy Table with Percentages

Pandas also makes it very easy to display these frequencies as percentages. This can make the data much simpler to understand. In order to transform the counts into percentages, you can use the normalize= parameter. By default, the argument is set to False, but toggling it to True converts the values:

# Showing percentages of value counts
print(df['Students'].value_counts(normalize=True))

# Returns: 
# 20    0.32
# 30    0.23
# 25    0.16
# 15    0.12
# 35    0.10
# 40    0.07
# Name: Students, dtype: float64

Remember, by default, Pandas will sort the values in descending order. Normalizing the data allows you to more easily how much of a dataset the category represents.

Binning Data in Pandas Frequency Tables

The Pandas value_counts method can also be used to bin data into different equal sized groups. This method is a convenience function for the Pandas .cut() method, and provides the number of values in each group.

Let’s split the data into four numeric groups using the bin= parameter:

# Binning data with Pandas value_counts
print(df['Students'].value_counts(bins=4))

# Returns: 
# (14.974, 21.25]    44
# (27.5, 33.75]      23
# (33.75, 40.0]      17
# (21.25, 27.5]      16
# Name: Students, dtype: int64

Dealing with Missing Values in Pandas Frequency Tables

In this section, you’ll learn how to work with missing data while working with the .value_counts() method. In order to cover this section off in better detail, let’s insert some missing values into our DataFrame.

# Inserting missing values into the DataFrame
df.loc[:10, 'Students'] = None

print(df.head())

# Returns:
#    Grade    Class Type  Students
# 0     10      Beginner       NaN
# 1     11  Intermediate       NaN
# 2     10      Beginner       NaN
# 3     12      Advanced       NaN
# 4      9  Intermediate       NaN

Now that we have missing values in our DataFrame, let’s apply the method with its default parameters and see how the results look:

# Seeing value counts
print(df['Students'].value_counts())

# Returns:
# 20.0    27
# 30.0    21
# 25.0    16
# 15.0    10
# 35.0     8
# 40.0     7
# Name: Students, dtype: int64

By default, the method will drop any missing values. It can often be useful to include these values. This can be done by passing in True into the dropna= parameter.

# Including Missing Values in the value_counts Method
print(df['Students'].value_counts(dropna=False))

# Returns:
# 20.0    27
# 30.0    21
# 25.0    16
# NaN     11
# 15.0    10
# 35.0     8
# 40.0     7
# Name: Students, dtype: int64

Calculating a Frequency Table on Multiple Columns with value_counts

The .value_counts() can also be applied the multiple columns. The benefit of applying the method to the entire DataFrame is that you gain access to the subset= parameter. This allows you to pass in a list of columns, which will return the values in the cross-section of columns. Let’s see what the distribution of values is across the Grade and Class Type columns.

# Applying .value_counts() to multiple columns
print(df.value_counts(subset=['Class Type', 'Grade']))

# Returns:
# Class Type    Grade
# Beginner      11       14
# Advanced      10       11
# Intermediate  10       10
#               12       10
#               9         9
# Advanced      9         7
#               11        7
# Beginner      10        7
# Intermediate  11        7
# Advanced      12        6
# Beginner      9         6
#               12        6
# dtype: int64

Right now, this returns a fairly messy Series. You can clean this up by sorting the index to better understand the hierarchy.

# Sorting the index after applying .value_counts to a DataFrame
print(df.value_counts(subset=['Class Type', 'Grade']).sort_index())

# Returns:
# Class Type    Grade
# Advanced      9         7
#               10       11
#               11        7
#               12        6
# Beginner      9         6
#               10        7
#               11       14
#               12        6
# Intermediate  9         9
#               10       10
#               11        7
#               12       10
# dtype: int64

Exercises

It’s time to check your learning! Try and complete the exercises below. To solve the exercises, use the DataFrame provided below:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/students.csv')
df.loc[:10, 'Students'] = None

To verify your solution, simply toggle the question.

How often does the second-highest number of students appear in the dataset?

print(df['Students'].value_counts().iloc[1])

# Returns: 21

What percentage of values in the Students column are missing?

print(df['Students'].value_counts(normalize=True, dropna=False)[NaN])

# Returns: 0.11

The .idxmax() method returns the index of the maximum value in a Series or DataFrame. Use the method to find the Grade that appears most often.

print(df['Grade'].value_counts().idxmax())
# Returns: 10

Conclusion and Recap

In this tutorial, you learned how to use the .value_counts() method to calculate a frequency table counting the values in a Series or DataFrame. The section below provides a recap of what you learned:

The value_counts() method can be applied to either a Pandas Series or DataFrame
The method counts the number of times a value appears
The method can convert the values into a normalized percentage, using the normalize=True argument
The method can be applied to multiple columns to establish a hierarchy between columns

Counting Values in Pandas with value_counts

Exploring the Pandas value_counts Method

Loading a Sample Pandas DataFrame

Creating a Pandas Frequency Table with value_counts

Sorting Your Pandas Frequency Table

Sorting Your Pandas Frequency Table by Index

Calculating a Pandas Frequecy Table with Percentages

Binning Data in Pandas Frequency Tables

Dealing with Missing Values in Pandas Frequency Tables

Calculating a Frequency Table on Multiple Columns with value_counts

Exercises

Conclusion and Recap

Additional Resources

Nik Piepenbreier

Counting Values in Pandas with value_counts

Exploring the Pandas value_counts Method

Loading a Sample Pandas DataFrame

Creating a Pandas Frequency Table with value_counts

Sorting Your Pandas Frequency Table

Sorting Your Pandas Frequency Table by Index

Calculating a Pandas Frequecy Table with Percentages

Binning Data in Pandas Frequency Tables

Dealing with Missing Values in Pandas Frequency Tables

Calculating a Frequency Table on Multiple Columns with value_counts

Exercises

Conclusion and Recap

Additional Resources

Nik Piepenbreier

Thank you!