The Pandas rank function can be used to rank your data and represents a viable equivalent to the SQL ROW_NUMBER function. In this tutorial, you’ll learn how to use the rank function including how to rank an entire dataframe or just a number of different columns. You’ll learn how to use the different parameters that the Pandas rank function offers. This allows you to change the ranking order and how to deal with equal values in their rankings. You’ll also learn how to rank a Pandas dataframe when combined with grouped data. You’ll walk through a practical example of how to select only the highest valued rows of a grouping.
The Quick Answer: Pandas .rank()
Ranks Your Data
Table of Contents
Understanding the Pandas Rank Method
The Pandas .rank()
method is very similar to the ROW_NUMBER() window function found in SQL. It allows you to, well, rank your data in different ways. On the surface, the function looks relatively simple. However, there are a lot of complexities that sit under the surface and this post will explore them all.
Let’s take a look at the basic syntax of the Pandas rank method to see what arguments are available to us:
import pandas as pd
df.rank(
axis = 0,
method = 'average',
numeric_only = True,
na_option = 'keep',
ascending = True,
pct = False
)
Let’s break these arguments down a little further, in order to see how we can use them to get our desired results when ranking our data:
axis=
: determines which index to direct the ranking in (either rows or columns)method=
: how to rank the groups of records that have the same valuenumeric_only=
: if the entire dataframe is being ranked, whether to include only numeric columns or notna_option=
: how to rankNaN
valuesascending=
: whether the elements should be ranked in ascending order or notpct=
: whether or not to display the returned rankings in percentile form (i.e., normalizing the rankings to a value of 1)
We can see here that the Pandas .rank()
method offes us a large amount of flexibility, while also providing a good amount of argument defaults to ensure general consistency.
One of the very interesting arguments is the method=
argument, which provides us with a number of options of how to treat duplicate values (i.e., ‘tied’ values). For example, we may want to rank those values the same or take the averge of them. In a later section, we’ll dive into this in a more detailed way in order to ensure you get the result you want.
Let’s begin by loading a sample Pandas dataframe to help follow along with.
Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!
Loading a Sample Pandas Dataframe
If you want to follow along with this tutorial line by line, feel free to load the sample Pandas dataframe below. If you have your own dataframe to follow along with, that’s great too!
Let’s see what our dataframe looks like by first loading it and then printing out the first five records, using the dataframe .head()
method.
# Loading a Sample Pandas Dataframe for the tutorial
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra', 'Piet', 'Maya'],
'Count': [100, 100, 105, 95, 75, 150],
'Score': [22, 33, 11, 55, 77, 99]
})
print(df.head())
# Returns:
# Name Count Score
# 0 Nik 100 22.0
# 1 Kate 100 33.0
# 2 Evan 105 11.0
# 3 Kyra 75 NaN
# 4 Piet 75 77.0
We can see that our dataframe has three columns: 1 with string values and 2 with numeric values. Now that we have a dataframe to work with, let’s get started in terms of ranking our data!
Want to learn how to use the Python zip()
function to iterate over two lists? This tutorial teaches you exactly what the zip()
function does and shows you some creative ways to use the function.
Basic Ranking of Your Pandas Dataframe
The easiest way in which to apply the Pandas .rank()
to an entire dataframe with all default arguments.
Let’s see what happens when we do this our dataframe, df:
# Ranking an entire dataframe
ranked = df.rank()
print(ranked)
# Returns:
# Name Count Score
# 0 5.0 3.5 2.0
# 1 2.0 3.5 3.0
# 2 1.0 5.0 1.0
# 3 3.0 1.5 NaN
# 4 6.0 1.5 4.0
# 5 4.0 6.0 5.0
In the example above, we applied the .rank()
method to our entire dataframe. Let’s take a look at what happened when we did this:
- The string column was ranked alphabetically, in ascending order
- Missing values are ranked as NaN’s, meaning that they’re essentially ignored in the ranking
- Equivalent items are ranked by the ‘average’ method, meaning that the values of the rank are averaged
Now, let’s see how we can rank only a single column. The Pandas .rank()
method is designed in such as way that it returns the same type as the object that calls the method – this means that the method will return a dataframe if a dataframe is passed in, and a series (or a column) when a series is passed in.
Let’s now take a look at how we can pass in only a single column and see how the ranking really plays out. We’ll create a new column, Score_Ranked
, that provides rankings for the Score
column. Let’s take a look at how we can do this:
# Ranking only a single column in Pandas
df['Score_Ranked'] = df['Score'].rank()
print(df)
# Returns:
# Name Count Score Score_Ranked
# 0 Nik 100 22.0 2.0
# 1 Kate 100 33.0 3.0
# 2 Evan 105 11.0 1.0
# 3 Kyra 75 NaN NaN
# 4 Piet 75 77.0 4.0
# 5 Maya 150 99.0 5.0
We can see here that a new column is created that provides the default settings for rankings of the Score
column.
In the following sections, you’ll learn how to change the arguments of the .rank()
method in order to modify its behaviour.
Want to learn more about Python for-loops? Check out my in-depth tutorial that takes your from beginner to advanced for-loops user! Want to watch a video instead? Check out my YouTube tutorial here.
Pandas Rank Dataframe with Reverse Sort Order
By default, the Pandas .rank()
method will rank data in ascending order, meaning that items with lower values will be ranked lower (i.e., starting at 1). If you want to change this behaviour and have the values rank in a descending order, we can set the ascending=False
parameter.
Let’s see what this looks like when we rank the the same column in different orders:
# Reversing ranking order of a Pandas Dataframe
df['Score_Ranked_Asc'] = df['Score'].rank()
df['Score_Ranked_Desc'] = df['Score'].rank(ascending=False)
print(df)
# Returns:
# Name Count Score Score_Ranked_Asc Score_Ranked_Desc
# 0 Nik 100 22.0 2.0 4.0
# 1 Kate 100 33.0 3.0 3.0
# 2 Evan 105 11.0 1.0 5.0
# 3 Kyra 75 NaN NaN NaN
# 4 Piet 75 77.0 4.0 2.0
# 5 Maya 150 99.0 5.0 1.0
We can see here that the ranked values begin on different ends, while the missing NaN values are still treated the same.
In the next section, you’ll learn how rank equal items in different methods by using the method=
argument.
Want to learn more about Python f-strings? Check out my in-depth tutorial, which includes a step-by-step video to master Python f-strings!
Pandas Rank Dataframe with Different Methods
It’s not uncommon to have data with equal values. Normally, this doesn’t raise any issues, but when your aim to use Pandas to rank your data, equivalent values need to be told how to be sorted. This is where the method=
argument comes in.
The Pandas rank method’s method=
argument takes a number of different options. Let’s take a quick look at them:
- ‘average’: the average rank of the group (e.g., if two values exist at rank 7, they’ll each be assigned the value of 7.5)
- ‘min’: returns the lowest rank in the group and assigns it to each value
- ‘max’: returns the highest rank in the group and assigns it to each value
- ‘first’: ranks are assigned in the order in which they appear in the dataframe
- ‘dense’: similar to the ‘min’ method, but the rank always increases by 1
At the surface, these values don’t seem like the most intuitive. The easiest way to understand them is to create the rankings for each method.
Let’s create a Pandas dataframe with each ranking method to better explore the method=
argument:
# Exploring the different methods by which to deal with duplicates in rankings
df['Count_average'] = df['Count'].rank(method='average')
df['Count_min'] = df['Count'].rank(method='min')
df['Count_max'] = df['Count'].rank(method='max')
df['Count_first'] = df['Count'].rank(method='first')
df['Count_dense'] = df['Count'].rank(method='dense')
print(df)
# Returns:
# Name Count Score Count_average Count_min Count_max Count_first Count_dense
# 0 Nik 100 22.0 3.5 3.0 4.0 3.0 2.0
# 1 Kate 100 33.0 3.5 3.0 4.0 4.0 2.0
# 2 Evan 105 11.0 5.0 5.0 5.0 5.0 3.0
# 3 Kyra 75 NaN 1.5 1.0 2.0 1.0 1.0
# 4 Piet 75 77.0 1.5 1.0 2.0 2.0 1.0
# 5 Maya 150 99.0 6.0 6.0 6.0 6.0 4.0
In the sample dataframe above, we can see the nuances of the different methods available to rank your data by.
In the next section, you’ll learn how to rank data in a group.
Want to learn more about calculating the square root in Python? Check out my tutorial here, which will teach you different ways of calculating the square root, both without Python functions and with the help of functions.
Pandas Rank Dataframe with a Groupby (Grouped Rankings)
A great application of the Pandas .rank()
method is to be able to apply it to a group. This has many practical applications such as being able to select the lowest or highest value on a particular day. To learn more about the Pandas .groupby()
method, check out my in-depth tutorial, on YouTube:
For this example, let’s load a different dataframe.
# Loading another sample dataframe
import pandas as pd
df = pd.DataFrame.from_dict({
'Date': ['2021-12-01', '2021-12-01', '2021-12-01', '2021-12-02', '2021-12-02'],
'Salesperson': ['Nik', 'Kate', 'Evan', 'Nik', 'Kate'],
'Sales': [100, 105, 110, 95, 130]
})
print(df)
# Returns
# Date Salesperson Sales
# 0 2021-12-01 Nik 100
# 1 2021-12-01 Kate 105
# 2 2021-12-01 Evan 110
# 3 2021-12-02 Nik 95
# 4 2021-12-02 Kate 130
Let’s see how we can group our data by the date and then rank by the Sales column:
# Ranking a column based on the grouping of another column in Pandas
import pandas as pd
df = pd.DataFrame.from_dict({
'Date': ['2021-12-01', '2021-12-01', '2021-12-01', '2021-12-02', '2021-12-02'],
'Salesperson': ['Nik', 'Kate', 'Evan', 'Nik', 'Kate'],
'Sales': [100, 105, 110, 95, 130]
})
df['Sales Ranked by Date'] = df.groupby('Date')['Sales'].rank(ascending=False)
print(df)
# Returns
# Date Salesperson Sales Sales Ranked by Date
# 0 2021-12-01 Nik 100 3.0
# 1 2021-12-01 Kate 105 2.0
# 2 2021-12-01 Evan 110 1.0
# 3 2021-12-02 Nik 95 2.0
# 4 2021-12-02 Kate 130 1.0
This is a bit more of a complex example – let’s break down what we’ve done here:
- We assigned a new column to rank our sales by date
- This column is based on grouping our data first by Date and then selecting only the Sales column
- We then rank that resulting grouped column in descending order
From this, we can easily select the top Salesperson by date, by filtering our Pandas dataframe. To learn more about selecting data in Pandas, check out my tutorial here.
Let’s see how we can do this:
# Selecting only the top values of a grouping
df['Sales Ranked by Date'] = df.groupby('Date')['Sales'].rank(ascending=False)
df = df[df['Sales Ranked by Date'] == 1.0]
print(df)
# Returns
# Date Salesperson Sales Sales Ranked by Date
# 2 2021-12-01 Evan 110 1.0
# 4 2021-12-02 Kate 130 1.0
When we filter our Pandas dataframe, we can easily see the top Salesperson by date!
In the next section, you’ll learn how to rank your Pandas dataframe with percentages, meaning a normalize ranking.
Want to learn how to get a file’s extension in Python? This tutorial will teach you how to use the os and pathlib libraries to do just that!
Pandas Rank Dataframe with Percentages (Normalized Rankings)
Another great attribute of the Pandas .rank()
method is that we can normalize our rankings to be a value between 0 and 1. While this may seem trivial, it does allow us to compare the minimum and maximum rankings across different columns, even when they have different numbers of unique values.
We can apply this normalized version of ranking by using the pct=
argument. Let’s see how we can apply this in Python and Pandas:
# Ranking a dataframe using normalized rankings
df = df.rank(pct=True)
print(df)
# Returns:
# Name Count Score
# 0 0.833333 0.583333 0.4
# 1 0.333333 0.583333 0.6
# 2 0.166667 0.833333 0.2
# 3 0.500000 0.250000 NaN
# 4 1.000000 0.250000 0.8
# 5 0.666667 1.000000 1.0
We can see here that all the columns rankings do not exceed 1. When we change our other parameters, as shown above, then we can change the stepping over similar values.
In the final section below, you’ll learn how to rank only numeric columns in a Pandas dataframe.
Need to check if a key exists in a Python dictionary? Check out this tutorial, which teaches you five different ways of seeing if a key exists in a Python dictionary, including how to return a default value.
Pandas Rank Only Numeric Columns in a Dataframe
When ranking an entire dataframe, you may wish to omit the ranking of non-numeric columns. This is because ranking based on alphabetical sorting doesn’t really carry the same weight as ranking numeric columns.
Let’s see how we can use the numeric_only=
argument to rank only numeric columns:
# Ranking only numeric columns with Pandas .rank()
df = df.rank(numeric_only=True)
print(df)
# Returns:
# Count Score
# 0 3.5 2.0
# 1 3.5 3.0
# 2 5.0 1.0
# 3 1.5 NaN
# 4 1.5 4.0
# 5 6.0 5.0
We can see that when we ask Pandas to only rank numeric columns, that it does not return any of non-numeric columns.
Need to automate renaming files? Check out this in-depth guide on using pathlib to rename files. More of a visual learner, the entire tutorial is also available as a video in the post!
Conclusion
In this tutorial, you learned how to use the Pank .rank()
method to rank your Pandas data. You learned how to rank just a single column or your entire dataframe. You also learned how to change the sort order of your rankings and how to rank with different methods, including a normalized ranking (proportionally out of 1). Finally, you learned how to rank your Pandas dataframe when data are grouped using a .groupby()
method, as well as how to limit your rankings only to numeric columns.
To learn more about the Pandas .rank()
method, check out the official documentation here.
Pingback: Pandas GroupBy: Group, Summarize, and Aggregate Data in Python
Thank you very much.
It was helpful and solved my problem 🙂
I’m so glad!!