Pandas Rank Function: Rank Dataframe Data (SQL row_number Equivalent)

Pandas Rank Cover Image

The Pandas rank function can be used to rank your data and represents a viable equivalent to the SQL ROW_NUMBER function. In this tutorial, you’ll learn how to use the rank function including how to rank an entire dataframe or just a number of different columns. You’ll learn how to use the different parameters that the Pandas rank function offers. This allows you to change the ranking order and how to deal with equal values in their rankings. You’ll also learn how to rank a Pandas dataframe when combined with grouped data. You’ll walk through a practical example of how to select only the highest valued rows of a grouping.

The Quick Answer: Pandas .rank() Ranks Your Data

Quick Answer - Pandas Rank Dataframe ROW_NUMBER
How the Pandas .rank() method works

Understanding the Pandas Rank Method

The Pandas .rank() method is very similar to the ROW_NUMBER() window function found in SQL. It allows you to, well, rank your data in different ways. On the surface, the function looks relatively simple. However, there are a lot of complexities that sit under the surface and this post will explore them all.

Let’s take a look at the basic syntax of the Pandas rank method to see what arguments are available to us:

import pandas as pd
df.rank(
    axis = 0,
    method = 'average',
    numeric_only = True,
    na_option = 'keep',
    ascending = True,
    pct = False
)

Let’s break these arguments down a little further, in order to see how we can use them to get our desired results when ranking our data:

  • axis=: determines which index to direct the ranking in (either rows or columns)
  • method=: how to rank the groups of records that have the same value
  • numeric_only=: if the entire dataframe is being ranked, whether to include only numeric columns or not
  • na_option=: how to rank NaN values
  • ascending=: whether the elements should be ranked in ascending order or not
  • pct=: whether or not to display the returned rankings in percentile form (i.e., normalizing the rankings to a value of 1)

We can see here that the Pandas .rank() method offes us a large amount of flexibility, while also providing a good amount of argument defaults to ensure general consistency.

One of the very interesting arguments is the method= argument, which provides us with a number of options of how to treat duplicate values (i.e., ‘tied’ values). For example, we may want to rank those values the same or take the averge of them. In a later section, we’ll dive into this in a more detailed way in order to ensure you get the result you want.

Let’s begin by loading a sample Pandas dataframe to help follow along with.

Check out some other Python tutorials on datagy, including our complete guide to styling Pandas and our comprehensive overview of Pivot Tables in Pandas!

Loading a Sample Pandas Dataframe

If you want to follow along with this tutorial line by line, feel free to load the sample Pandas dataframe below. If you have your own dataframe to follow along with, that’s great too!

Let’s see what our dataframe looks like by first loading it and then printing out the first five records, using the dataframe .head() method.

# Loading a Sample Pandas Dataframe for the tutorial
import pandas as pd

df = pd.DataFrame.from_dict({
    'Name': ['Nik', 'Kate', 'Evan', 'Kyra', 'Piet', 'Maya'],
    'Count': [100, 100, 105, 95, 75, 150],
    'Score': [22, 33, 11, 55, 77, 99] 
})

print(df.head())

# Returns:
#    Name  Count  Score
# 0   Nik    100   22.0
# 1  Kate    100   33.0
# 2  Evan    105   11.0
# 3  Kyra     75    NaN
# 4  Piet     75   77.0

We can see that our dataframe has three columns: 1 with string values and 2 with numeric values. Now that we have a dataframe to work with, let’s get started in terms of ranking our data!

Want to learn how to use the Python zip() function to iterate over two lists? This tutorial teaches you exactly what the zip() function does and shows you some creative ways to use the function.

Basic Ranking of Your Pandas Dataframe

The easiest way in which to apply the Pandas .rank() to an entire dataframe with all default arguments.

Let’s see what happens when we do this our dataframe, df:

# Ranking an entire dataframe
ranked = df.rank()

print(ranked)

# Returns:
#    Name  Count  Score
# 0   5.0    3.5    2.0
# 1   2.0    3.5    3.0
# 2   1.0    5.0    1.0
# 3   3.0    1.5    NaN
# 4   6.0    1.5    4.0
# 5   4.0    6.0    5.0

In the example above, we applied the .rank() method to our entire dataframe. Let’s take a look at what happened when we did this:

  • The string column was ranked alphabetically, in ascending order
  • Missing values are ranked as NaN’s, meaning that they’re essentially ignored in the ranking
  • Equivalent items are ranked by the ‘average’ method, meaning that the values of the rank are averaged

Now, let’s see how we can rank only a single column. The Pandas .rank() method is designed in such as way that it returns the same type as the object that calls the method – this means that the method will return a dataframe if a dataframe is passed in, and a series (or a column) when a series is passed in.

Let’s now take a look at how we can pass in only a single column and see how the ranking really plays out. We’ll create a new column, Score_Ranked, that provides rankings for the Score column. Let’s take a look at how we can do this:

# Ranking only a single column in Pandas
df['Score_Ranked'] = df['Score'].rank()

print(df)

# Returns:
#    Name  Count  Score  Score_Ranked
# 0   Nik    100   22.0           2.0
# 1  Kate    100   33.0           3.0
# 2  Evan    105   11.0           1.0
# 3  Kyra     75    NaN           NaN
# 4  Piet     75   77.0           4.0
# 5  Maya    150   99.0           5.0

We can see here that a new column is created that provides the default settings for rankings of the Score column.

In the following sections, you’ll learn how to change the arguments of the .rank() method in order to modify its behaviour.

Want to learn more about Python for-loops? Check out my in-depth tutorial that takes your from beginner to advanced for-loops user! Want to watch a video instead? Check out my YouTube tutorial here.

Pandas Rank Dataframe with Reverse Sort Order

By default, the Pandas .rank() method will rank data in ascending order, meaning that items with lower values will be ranked lower (i.e., starting at 1). If you want to change this behaviour and have the values rank in a descending order, we can set the ascending=False parameter.

Let’s see what this looks like when we rank the the same column in different orders:

# Reversing ranking order of a Pandas Dataframe
df['Score_Ranked_Asc'] = df['Score'].rank()
df['Score_Ranked_Desc'] = df['Score'].rank(ascending=False)

print(df)

# Returns:
#    Name  Count  Score  Score_Ranked_Asc  Score_Ranked_Desc
# 0   Nik    100   22.0               2.0                4.0
# 1  Kate    100   33.0               3.0                3.0
# 2  Evan    105   11.0               1.0                5.0
# 3  Kyra     75    NaN               NaN                NaN
# 4  Piet     75   77.0               4.0                2.0
# 5  Maya    150   99.0               5.0                1.0

We can see here that the ranked values begin on different ends, while the missing NaN values are still treated the same.

In the next section, you’ll learn how rank equal items in different methods by using the method= argument.

Want to learn more about Python f-strings? Check out my in-depth tutorial, which includes a step-by-step video to master Python f-strings!

Pandas Rank Dataframe with Different Methods

It’s not uncommon to have data with equal values. Normally, this doesn’t raise any issues, but when your aim to use Pandas to rank your data, equivalent values need to be told how to be sorted. This is where the method= argument comes in.

The Pandas rank method’s method= argument takes a number of different options. Let’s take a quick look at them:

  • ‘average’: the average rank of the group (e.g., if two values exist at rank 7, they’ll each be assigned the value of 7.5)
  • ‘min’: returns the lowest rank in the group and assigns it to each value
  • ‘max’: returns the highest rank in the group and assigns it to each value
  • ‘first’: ranks are assigned in the order in which they appear in the dataframe
  • ‘dense’: similar to the ‘min’ method, but the rank always increases by 1

At the surface, these values don’t seem like the most intuitive. The easiest way to understand them is to create the rankings for each method.

Let’s create a Pandas dataframe with each ranking method to better explore the method= argument:

# Exploring the different methods by which to deal with duplicates in rankings
df['Count_average'] = df['Count'].rank(method='average')
df['Count_min'] = df['Count'].rank(method='min')
df['Count_max'] = df['Count'].rank(method='max')
df['Count_first'] = df['Count'].rank(method='first')
df['Count_dense'] = df['Count'].rank(method='dense')

print(df)

# Returns:
#    Name  Count  Score  Count_average  Count_min  Count_max  Count_first  Count_dense
# 0   Nik    100   22.0            3.5        3.0        4.0          3.0          2.0
# 1  Kate    100   33.0            3.5        3.0        4.0          4.0          2.0
# 2  Evan    105   11.0            5.0        5.0        5.0          5.0          3.0
# 3  Kyra     75    NaN            1.5        1.0        2.0          1.0          1.0
# 4  Piet     75   77.0            1.5        1.0        2.0          2.0          1.0
# 5  Maya    150   99.0            6.0        6.0        6.0          6.0          4.0

In the sample dataframe above, we can see the nuances of the different methods available to rank your data by.

In the next section, you’ll learn how to rank data in a group.

Want to learn more about calculating the square root in Python? Check out my tutorial here, which will teach you different ways of calculating the square root, both without Python functions and with the help of functions.

Pandas Rank Dataframe with a Groupby (Grouped Rankings)

A great application of the Pandas .rank() method is to be able to apply it to a group. This has many practical applications such as being able to select the lowest or highest value on a particular day. To learn more about the Pandas .groupby() method, check out my in-depth tutorial, on YouTube:

For this example, let’s load a different dataframe.

# Loading another sample dataframe
import pandas as pd
df = pd.DataFrame.from_dict({
    'Date': ['2021-12-01', '2021-12-01', '2021-12-01', '2021-12-02', '2021-12-02'],
    'Salesperson': ['Nik', 'Kate', 'Evan', 'Nik', 'Kate'],
    'Sales': [100, 105, 110, 95, 130]
})

print(df)

# Returns
#          Date Salesperson  Sales
# 0  2021-12-01         Nik    100
# 1  2021-12-01        Kate    105
# 2  2021-12-01        Evan    110
# 3  2021-12-02         Nik     95
# 4  2021-12-02        Kate    130

Let’s see how we can group our data by the date and then rank by the Sales column:

# Ranking a column based on the grouping of another column in Pandas
import pandas as pd
df = pd.DataFrame.from_dict({
    'Date': ['2021-12-01', '2021-12-01', '2021-12-01', '2021-12-02', '2021-12-02'],
    'Salesperson': ['Nik', 'Kate', 'Evan', 'Nik', 'Kate'],
    'Sales': [100, 105, 110, 95, 130]
})

df['Sales Ranked by Date'] = df.groupby('Date')['Sales'].rank(ascending=False)

print(df)

# Returns
#          Date Salesperson  Sales  Sales Ranked by Date
# 0  2021-12-01         Nik    100                   3.0
# 1  2021-12-01        Kate    105                   2.0
# 2  2021-12-01        Evan    110                   1.0
# 3  2021-12-02         Nik     95                   2.0
# 4  2021-12-02        Kate    130                   1.0

This is a bit more of a complex example – let’s break down what we’ve done here:

  1. We assigned a new column to rank our sales by date
  2. This column is based on grouping our data first by Date and then selecting only the Sales column
  3. We then rank that resulting grouped column in descending order

From this, we can easily select the top Salesperson by date, by filtering our Pandas dataframe. To learn more about selecting data in Pandas, check out my tutorial here.

Let’s see how we can do this:

# Selecting only the top values of a grouping
df['Sales Ranked by Date'] = df.groupby('Date')['Sales'].rank(ascending=False)
df = df[df['Sales Ranked by Date'] == 1.0]

print(df)

# Returns
#          Date Salesperson  Sales  Sales Ranked by Date
# 2  2021-12-01        Evan    110                   1.0
# 4  2021-12-02        Kate    130                   1.0

When we filter our Pandas dataframe, we can easily see the top Salesperson by date!

In the next section, you’ll learn how to rank your Pandas dataframe with percentages, meaning a normalize ranking.

Want to learn how to get a file’s extension in Python? This tutorial will teach you how to use the os and pathlib libraries to do just that!

Pandas Rank Dataframe with Percentages (Normalized Rankings)

Another great attribute of the Pandas .rank() method is that we can normalize our rankings to be a value between 0 and 1. While this may seem trivial, it does allow us to compare the minimum and maximum rankings across different columns, even when they have different numbers of unique values.

We can apply this normalized version of ranking by using the pct= argument. Let’s see how we can apply this in Python and Pandas:

# Ranking a dataframe using normalized rankings
df = df.rank(pct=True)

print(df)

# Returns:
#        Name     Count  Score
# 0  0.833333  0.583333    0.4
# 1  0.333333  0.583333    0.6
# 2  0.166667  0.833333    0.2
# 3  0.500000  0.250000    NaN
# 4  1.000000  0.250000    0.8
# 5  0.666667  1.000000    1.0

We can see here that all the columns rankings do not exceed 1. When we change our other parameters, as shown above, then we can change the stepping over similar values.

In the final section below, you’ll learn how to rank only numeric columns in a Pandas dataframe.

Need to check if a key exists in a Python dictionary? Check out this tutorial, which teaches you five different ways of seeing if a key exists in a Python dictionary, including how to return a default value.

Pandas Rank Only Numeric Columns in a Dataframe

When ranking an entire dataframe, you may wish to omit the ranking of non-numeric columns. This is because ranking based on alphabetical sorting doesn’t really carry the same weight as ranking numeric columns.

Let’s see how we can use the numeric_only= argument to rank only numeric columns:

# Ranking only numeric columns with Pandas .rank()
df = df.rank(numeric_only=True)

print(df)

# Returns:
#    Count  Score
# 0    3.5    2.0
# 1    3.5    3.0
# 2    5.0    1.0
# 3    1.5    NaN
# 4    1.5    4.0
# 5    6.0    5.0

We can see that when we ask Pandas to only rank numeric columns, that it does not return any of non-numeric columns.

Need to automate renaming files? Check out this in-depth guide on using pathlib to rename files. More of a visual learner, the entire tutorial is also available as a video in the post!

Conclusion

In this tutorial, you learned how to use the Pank .rank() method to rank your Pandas data. You learned how to rank just a single column or your entire dataframe. You also learned how to change the sort order of your rankings and how to rank with different methods, including a normalized ranking (proportionally out of 1). Finally, you learned how to rank your Pandas dataframe when data are grouped using a .groupby() method, as well as how to limit your rankings only to numeric columns.

To learn more about the Pandas .rank() method, check out the official documentation here.