Skip to content

How to Sort Data in a Pandas DataFrame

How to Sort Data in a Pandas DataFrame Cover Image

Sorting data is an essential method to better understand your data. In this post, you’ll learn how to sort data in a Pandas DataFrame using the Pandas .sort_values() function, in ascending and descending order, as well as sorting by multiple columns.

Being able to sort your data opens you up to many different opportunities. When working in more visual data tools, like Excel, one of the most common tasks you’re probably familiar with is sorting your data. This allows you to get a sense of the data you’re working with. Being able to do this in Pandas opens you up to a broad type of additional analysis to take on. Let’s get started!

Video Tutorial

Exploring the Pandas sort_values() Method

We can sort values in a Pandas DataFrame by using the .sort_values() method. The method provides an incredible array of parameters that allow you to customize how your data is sorted to your heart’s content! Let’s take a look at the large variety of options available to us:

df.sort_values(
    by = [],
    axis = 0, 
    ascending = True, 
    inplace = False, 
    kind = 'quicksort', 
    na_position = 'last', 
    ignore_index = False, 
    key = None
)

The table below breaks down these parameters one by one which allows you to get a sense of how they can change your sorting of data:

ParameterInput TypeDefault ValueDescription
by=string or list of stringsN/AThe name of the column (or row) or list of names to sort by
axis=0 or 10The axis to be sorted (0 for rows and 1 for columns)
ascending=boolean, or list of booleansTrueWhether to sort ascendingly or descendingly, and can be different for each column passed in
inplace=booleanFalseWhether to perform the operation in place or not
kind=string‘quicksort’The choice of sorting algorithm, from the following options: {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}
na_position=string‘last’Whether to put missing values in first or last position
ignore_index=booleanFalseWhether to relabel the index axis or not
key=callableNoneThe callable function to apply to help sort the data
The parameters of the Pandas .sort_values() method

There’s a lot to customize in the .sort_values() method. There is also a lot that you can accomplish with just a few parameters. Now let’s dive into sorting your data.

Loading a Sample Pandas DataFrame

Let’s start by loading a sample Pandas DataFrame. The dataset is hosted on Github and can be loaded using the .read_csv() function. We’ll add in an additional parameter to parse our 'date' column as dates. After loading the DataFrame, let’s print out the first five rows using the .head() method:

# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv', parse_dates=['date'])

print(df.head())

# Returns:
#         date  gender      region  sales
# 0 2022-08-22    Male  North-West  20381
# 1 2022-03-05    Male  North-East  14495
# 2 2022-02-09    Male  North-East  13510
# 3 2022-06-22    Male  North-East  15983
# 4 2022-08-10  Female  North-West  15007

We can see that the DataFrame has four different columns, of three different data types:

  1. 'date' is a column describing the date of a sale
  2. 'gender' and 'region' are string-type columns
  3. 'sales' is an integer column that describes the amount of sales made on a given day

Let’s dive into how to sort our Pandas DataFrame using the .sort_values() method.

Sorting a Single Pandas DataFrame Column

The key parameter in the .sort_values() function is the by= parameter, as it tells Pandas which column(s) to sort by. The parameter takes either a single column as a string or a list of columns as a list of strings. Let’s start by sorting our data by a single column. We can sort the data by the 'sales' column.

# Sort a Pandas DataFrame by a Single Column
sorted = df.sort_values(by='sales')
print(sorted.head())

# Returns:
#           date  gender      region  sales
# 251 2022-10-20    Male  North-West   6084
# 346 2022-05-06  Female  North-West   6403
# 489 2022-07-01  Female  North-West   6824
# 463 2022-09-12  Female  North-West   7432
# 890 2022-10-02    Male  North-West   7633

Let’s break down what we did here:

  1. We created a new DataFrame, sorted
  2. We appled the .sort_values() method, passing in only a single column to sort our data by

By default, Pandas will sort data in ascending order. This means that the smallest numbers will be placed at the top. In later sections, you’ll learn how to modify this behavior to sort data in a different order.

Sorting Multiple Pandas DataFrame Columns

The Pandas .sort_values() method makes it easy to sort by multiple columns. In the previous section, you learned to pass in a single column as a string value. If you pass in a list of strings, you can modify the sort behavior. This allows you to establish a sorting hierarchy, where data are first sorted by the values in one column, and then establish a sort order within that order.

Let’s sort our data first by the ‘region’ column and then by the ‘sales’ column.

# Sort a Pandas DataFrame by Multiple Column
sorted = df.sort_values(by=['region', 'sales'])
print(sorted.head())

# Returns:
#           date  gender      region  sales
# 850 2022-07-26  Female  North-East  11664
# 770 2022-05-27    Male  North-East  12009
# 434 2022-11-28  Female  North-East  12353
# 76  2022-11-14  Female  North-East  12404
# 478 2022-09-12  Female  North-East  12709

This looks quite different from the result we achieved in our first example. What’s happened is that our data are first sorted by the 'region' column, in ascending order. Values belonging to that region are subsequently sorted by the 'sales' column. This means that the ordering is reset when the second outer column is reached.

Change Sort Order in Pandas sort_values

In the examples above, we saw that the sort order defaulted to sort data in ascending order. We can modify this behavior by making use of the ascending= parameter. The parameter accepts a boolean value, meaning either True or False. The default value for this is True. If you wanted to change the data to sort in descending order, simply change the value to False. Let’s give this a try!

# Sorting Values in Descending Order in Pandas
sorted = df.sort_values(by='sales', ascending=False)
print(sorted.head())

# Returns:
#           date  gender region  sales
# 61  2022-02-22  Female  South  43775
# 673 2022-04-19    Male  South  37878
# 111 2022-10-31  Female  South  36444
# 892 2022-09-05    Male  South  35723
# 136 2022-02-27    Male  South  35485

We can see above that the data was sorted by the 'sales' column but in descending order. This means that the data started with the highest value and goes down from there.

Changing Sort Order for Multiple DataFrame Columns

Similar to how you were able to pass in a list of columns to sort by multiple columns, you’re also able to pass in a list of boolean values to modify the sort order of the various columns. This means that we can sort one column in, say, ascending order and another in descending order. Let’s take a look at this below:

# Changing the Sort Order of Multiple Columns
sorted = df.sort_values(
    by=['region', 'sales'],
    ascending=[False, True]
)
print(sorted.head())

# Returns:
#           date  gender region  sales
# 739 2022-02-28    Male  South  10768
# 460 2022-11-15    Male  South  11567
# 616 2022-11-18  Female  South  12940
# 462 2022-09-28    Male  South  13055
# 213 2022-01-25    Male  South  13230

What we did here was pass in a list of boolean values which allowed us to modify the sort order for each column. What we did here was sort first by the 'region' column in descending order (meaning reverse alphabetical). We then sorted the data by the 'sales' column in increasing order.

Sorting with Missing Values in a Pandas DataFrame

Missing values can often cause unexpected results. Thankfully, Pandas gives you extensive control of how these values are sorted. By default, missing values are sorted at the end of the sort values. Let’s modify our DataFrame to include some missing values:

# Adding Missing Values to Our DataFrame
df.loc[10:20, 'sales'] = None

Using the code above allowed us to insert ten missing values into our DataFrame. We can now explore how the missing values are sorted. Let’s sort our DataFrame by the 'sales' column in ascending order and see where our missing data ends up:

# Understanding Sort Order of Missing Data
sorted = df.sort_values(
    by='sales',
)

print(sorted)

# Returns:
#           date  gender      region   sales
# 251 2022-10-20    Male  North-West  6084.0
# 346 2022-05-06  Female  North-West  6403.0
# 489 2022-07-01  Female  North-West  6824.0
# 463 2022-09-12  Female  North-West  7432.0
# 890 2022-10-02    Male  North-West  7633.0
# ..         ...     ...         ...     ...
# 16  2022-04-18    Male  North-West     NaN
# 17  2022-07-30  Female  North-West     NaN
# 18  2022-02-03    Male  North-East     NaN
# 19  2022-01-15    Male  North-East     NaN
# 20  2022-01-21  Female       South     NaN

We can see that our missing data ended up at the bottom of our resulting DataFrame. This is because the default argument is 'na_position='last'. If we wanted the missing data to appear at the top of the sort order, we can pass in 'first' as the argument. Let’s give this a shot:

# Changing Sort Order of Missing Data
sorted = df.sort_values(
    by='sales',
    na_position='first'
)

print(sorted)

# Returns:
#           date  gender      region    sales
# 10  2022-12-21    Male       South      NaN
# 11  2022-04-30  Female  North-West      NaN
# 12  2022-11-25  Female  North-East      NaN
# 13  2022-08-14  Female  North-East      NaN
# 14  2022-02-24  Female  North-East      NaN
# ..         ...     ...         ...      ...
# 136 2022-02-27    Male       South  35485.0
# 892 2022-09-05    Male       South  35723.0
# 111 2022-10-31  Female       South  36444.0
# 673 2022-04-19    Male       South  37878.0
# 61  2022-02-22  Female       South  43775.0

Resetting an Index with a Sorted Pandas DataFrame

One of the things you may have noticed is that in the previous examples, the resulting DataFrame maintained its original index labels. This may not always be the result you’re hoping for. Because of this, we can modify the ignore_index= argument, which defaults to False. When we modify the boolean to True, we can let Pandas know that we want to effectively reset the index.

# Ignoring an Index when Sorting
sorted = df.sort_values(
    by='sales',
    ignore_index=True
)

print(sorted.head())

# Returns:
#         date  gender      region  sales
# 0 2022-10-20    Male  North-West   6084
# 1 2022-05-06  Female  North-West   6403
# 2 2022-07-01  Female  North-West   6824
# 3 2022-09-12  Female  North-West   7432
# 4 2022-10-02    Male  North-West   7633

The resultant DataFrame’s index begins at 0 and increases through to the length of the DataFrame minus 1. This is a much cleaner DataFrame. However, keep in mind that this modifies the index permanently. If the index represents meaningful labeled data, this may not be the result you were intending.

In the final section of this tutorial, you’ll learn how to sort your Pandas DataFrame in place, meaning that it does not need to be re-assigned.

Sorting a Pandas DataFrame In Place

In all of the above examples, you have learned to re-assign the resulting DataFrame. Pandas, however, also provides you with the option to sort the data in place. What this means is that the original DataFrame is modified directly, without needing to create a new object.

This can be done by modifying the inplace= parameter. This parameter defaults to False; modifying it to True will allow the operation to occur in place. Let’s see what this looks like:

# Sorting a DataFrame in place
df.sort_values(
    by='sales',
    inplace=True
)

print(df.head())

# Returns:
#           date  gender      region  sales
# 251 2022-10-20    Male  North-West   6084
# 346 2022-05-06  Female  North-West   6403
# 489 2022-07-01  Female  North-West   6824
# 463 2022-09-12  Female  North-West   7432
# 890 2022-10-02    Male  North-West   7633

The resulting DataFrame above has been modified in place, meaning that we didn’t need to re-assign it.

Exercises

It’s time to check test your learning! There are a number of exercises below, for which you can find the solutions by clicking on the toggle. Try to solve them on your own first and then check your understanding by viewing the solution.

df.sort_values(by=['gender', 'region'], ascending=[False, False])

You could combine .sort_values() with .iloc:

print(
    df.sort_values(
        by='sales',         # Sort by Sales
        ascending=False     # Sort in descending order
    ).iloc[1, 3])           # Access the second item in the fourth column

Conclusion and Recap

In this tutorial, you learned how to sort your Pandas DataFrame using the .sort_values() method. As a quick refresher:

  • The Pandas .sort_values() method allows you to sort a dataframe by one or by multiple columns
  • The default sort method is in ascending order placing missing values at the end
  • You can establish different hierarchies by sorting by multiple columns
  • Ignoring your index allows you to build a tidier DataFrame
  • The operation can also be done in place, allowing you to sort without needing to re-assign your values

To learn more about the .sort_values() method, check out the official documentation here.

1 thought on “How to Sort Data in a Pandas DataFrame”

  1. Pingback: Creating Pivot Tables in Pandas with Python for Python and Pandas • datagy

Leave a Reply

Your email address will not be published.