Sorting data is an essential method to better understand your data. In this post, you’ll learn how to sort data in a Pandas DataFrame using the Pandas .sort_values()
function, in ascending and descending order, as well as sorting by multiple columns.
Being able to sort your data opens you up to many different opportunities. When working in more visual data tools, like Excel, one of the most common tasks you’re probably familiar with is sorting your data. This allows you to get a sense of the data you’re working with. Being able to do this in Pandas opens you up to a broad type of additional analysis to take on. Let’s get started!
Table of Contents
Video Tutorial
Exploring the Pandas sort_values() Method
We can sort values in a Pandas DataFrame by using the .sort_values()
method. The method provides an incredible array of parameters that allow you to customize how your data is sorted to your heart’s content! Let’s take a look at the large variety of options available to us:
df.sort_values(
by = [],
axis = 0,
ascending = True,
inplace = False,
kind = 'quicksort',
na_position = 'last',
ignore_index = False,
key = None
)
The table below breaks down these parameters one by one which allows you to get a sense of how they can change your sorting of data:
Parameter | Input Type | Default Value | Description |
---|---|---|---|
by= | string or list of strings | N/A | The name of the column (or row) or list of names to sort by |
axis= | 0 or 1 | 0 | The axis to be sorted (0 for rows and 1 for columns) |
ascending= | boolean, or list of booleans | True | Whether to sort ascendingly or descendingly, and can be different for each column passed in |
inplace= | boolean | False | Whether to perform the operation in place or not |
kind= | string | ‘quicksort’ | The choice of sorting algorithm, from the following options: {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’} |
na_position= | string | ‘last’ | Whether to put missing values in first or last position |
ignore_index= | boolean | False | Whether to relabel the index axis or not |
key= | callable | None | The callable function to apply to help sort the data |
.sort_values()
methodThere’s a lot to customize in the .sort_values()
method. There is also a lot that you can accomplish with just a few parameters. Now let’s dive into sorting your data.
Loading a Sample Pandas DataFrame
Let’s start by loading a sample Pandas DataFrame. The dataset is hosted on Github and can be loaded using the .read_csv()
function. We’ll add in an additional parameter to parse our 'date'
column as dates. After loading the DataFrame, let’s print out the first five rows using the .head()
method:
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv', parse_dates=['date'])
print(df.head())
# Returns:
# date gender region sales
# 0 2022-08-22 Male North-West 20381
# 1 2022-03-05 Male North-East 14495
# 2 2022-02-09 Male North-East 13510
# 3 2022-06-22 Male North-East 15983
# 4 2022-08-10 Female North-West 15007
We can see that the DataFrame has four different columns, of three different data types:
'date'
is a column describing the date of a sale'gender'
and'region'
are string-type columns'sales'
is an integer column that describes the amount of sales made on a given day
Let’s dive into how to sort our Pandas DataFrame using the .sort_values()
method.
Sorting a Single Pandas DataFrame Column
The key parameter in the .sort_values()
function is the by=
parameter, as it tells Pandas which column(s) to sort by. The parameter takes either a single column as a string or a list of columns as a list of strings. Let’s start by sorting our data by a single column. We can sort the data by the 'sales'
column.
# Sort a Pandas DataFrame by a Single Column
sorted = df.sort_values(by='sales')
print(sorted.head())
# Returns:
# date gender region sales
# 251 2022-10-20 Male North-West 6084
# 346 2022-05-06 Female North-West 6403
# 489 2022-07-01 Female North-West 6824
# 463 2022-09-12 Female North-West 7432
# 890 2022-10-02 Male North-West 7633
Let’s break down what we did here:
- We created a new DataFrame,
sorted
- We appled the
.sort_values()
method, passing in only a single column to sort our data by
By default, Pandas will sort data in ascending order. This means that the smallest numbers will be placed at the top. In later sections, you’ll learn how to modify this behavior to sort data in a different order.
Sorting Multiple Pandas DataFrame Columns
The Pandas .sort_values() method makes it easy to sort by multiple columns. In the previous section, you learned to pass in a single column as a string value. If you pass in a list of strings, you can modify the sort behavior. This allows you to establish a sorting hierarchy, where data are first sorted by the values in one column, and then establish a sort order within that order.
Let’s sort our data first by the ‘region’ column and then by the ‘sales’ column.
# Sort a Pandas DataFrame by Multiple Column
sorted = df.sort_values(by=['region', 'sales'])
print(sorted.head())
# Returns:
# date gender region sales
# 850 2022-07-26 Female North-East 11664
# 770 2022-05-27 Male North-East 12009
# 434 2022-11-28 Female North-East 12353
# 76 2022-11-14 Female North-East 12404
# 478 2022-09-12 Female North-East 12709
This looks quite different from the result we achieved in our first example. What’s happened is that our data are first sorted by the 'region'
column, in ascending order. Values belonging to that region are subsequently sorted by the 'sales'
column. This means that the ordering is reset when the second outer column is reached.
Change Sort Order in Pandas sort_values
In the examples above, we saw that the sort order defaulted to sort data in ascending order. We can modify this behavior by making use of the ascending=
parameter. The parameter accepts a boolean value, meaning either True
or False
. The default value for this is True
. If you wanted to change the data to sort in descending order, simply change the value to False
. Let’s give this a try!
# Sorting Values in Descending Order in Pandas
sorted = df.sort_values(by='sales', ascending=False)
print(sorted.head())
# Returns:
# date gender region sales
# 61 2022-02-22 Female South 43775
# 673 2022-04-19 Male South 37878
# 111 2022-10-31 Female South 36444
# 892 2022-09-05 Male South 35723
# 136 2022-02-27 Male South 35485
We can see above that the data was sorted by the 'sales'
column but in descending order. This means that the data started with the highest value and goes down from there.
Changing Sort Order for Multiple DataFrame Columns
Similar to how you were able to pass in a list of columns to sort by multiple columns, you’re also able to pass in a list of boolean values to modify the sort order of the various columns. This means that we can sort one column in, say, ascending order and another in descending order. Let’s take a look at this below:
# Changing the Sort Order of Multiple Columns
sorted = df.sort_values(
by=['region', 'sales'],
ascending=[False, True]
)
print(sorted.head())
# Returns:
# date gender region sales
# 739 2022-02-28 Male South 10768
# 460 2022-11-15 Male South 11567
# 616 2022-11-18 Female South 12940
# 462 2022-09-28 Male South 13055
# 213 2022-01-25 Male South 13230
What we did here was pass in a list of boolean values which allowed us to modify the sort order for each column. What we did here was sort first by the 'region'
column in descending order (meaning reverse alphabetical). We then sorted the data by the 'sales'
column in increasing order.
Sorting with Missing Values in a Pandas DataFrame
Missing values can often cause unexpected results. Thankfully, Pandas gives you extensive control of how these values are sorted. By default, missing values are sorted at the end of the sort values. Let’s modify our DataFrame to include some missing values:
# Adding Missing Values to Our DataFrame
df.loc[10:20, 'sales'] = None
Using the code above allowed us to insert ten missing values into our DataFrame. We can now explore how the missing values are sorted. Let’s sort our DataFrame by the 'sales'
column in ascending order and see where our missing data ends up:
# Understanding Sort Order of Missing Data
sorted = df.sort_values(
by='sales',
)
print(sorted)
# Returns:
# date gender region sales
# 251 2022-10-20 Male North-West 6084.0
# 346 2022-05-06 Female North-West 6403.0
# 489 2022-07-01 Female North-West 6824.0
# 463 2022-09-12 Female North-West 7432.0
# 890 2022-10-02 Male North-West 7633.0
# .. ... ... ... ...
# 16 2022-04-18 Male North-West NaN
# 17 2022-07-30 Female North-West NaN
# 18 2022-02-03 Male North-East NaN
# 19 2022-01-15 Male North-East NaN
# 20 2022-01-21 Female South NaN
We can see that our missing data ended up at the bottom of our resulting DataFrame. This is because the default argument is 'na_position='last'
. If we wanted the missing data to appear at the top of the sort order, we can pass in 'first'
as the argument. Let’s give this a shot:
# Changing Sort Order of Missing Data
sorted = df.sort_values(
by='sales',
na_position='first'
)
print(sorted)
# Returns:
# date gender region sales
# 10 2022-12-21 Male South NaN
# 11 2022-04-30 Female North-West NaN
# 12 2022-11-25 Female North-East NaN
# 13 2022-08-14 Female North-East NaN
# 14 2022-02-24 Female North-East NaN
# .. ... ... ... ...
# 136 2022-02-27 Male South 35485.0
# 892 2022-09-05 Male South 35723.0
# 111 2022-10-31 Female South 36444.0
# 673 2022-04-19 Male South 37878.0
# 61 2022-02-22 Female South 43775.0
Resetting an Index with a Sorted Pandas DataFrame
One of the things you may have noticed is that in the previous examples, the resulting DataFrame maintained its original index labels. This may not always be the result you’re hoping for. Because of this, we can modify the ignore_index=
argument, which defaults to False
. When we modify the boolean to True, we can let Pandas know that we want to effectively reset the index.
# Ignoring an Index when Sorting
sorted = df.sort_values(
by='sales',
ignore_index=True
)
print(sorted.head())
# Returns:
# date gender region sales
# 0 2022-10-20 Male North-West 6084
# 1 2022-05-06 Female North-West 6403
# 2 2022-07-01 Female North-West 6824
# 3 2022-09-12 Female North-West 7432
# 4 2022-10-02 Male North-West 7633
The resultant DataFrame’s index begins at 0 and increases through to the length of the DataFrame minus 1. This is a much cleaner DataFrame. However, keep in mind that this modifies the index permanently. If the index represents meaningful labeled data, this may not be the result you were intending.
In the final section of this tutorial, you’ll learn how to sort your Pandas DataFrame in place, meaning that it does not need to be re-assigned.
Sorting a Pandas DataFrame In Place
In all of the above examples, you have learned to re-assign the resulting DataFrame. Pandas, however, also provides you with the option to sort the data in place. What this means is that the original DataFrame is modified directly, without needing to create a new object.
This can be done by modifying the inplace=
parameter. This parameter defaults to False
; modifying it to True
will allow the operation to occur in place. Let’s see what this looks like:
# Sorting a DataFrame in place
df.sort_values(
by='sales',
inplace=True
)
print(df.head())
# Returns:
# date gender region sales
# 251 2022-10-20 Male North-West 6084
# 346 2022-05-06 Female North-West 6403
# 489 2022-07-01 Female North-West 6824
# 463 2022-09-12 Female North-West 7432
# 890 2022-10-02 Male North-West 7633
The resulting DataFrame above has been modified in place, meaning that we didn’t need to re-assign it.
Exercises
It’s time to check test your learning! There are a number of exercises below, for which you can find the solutions by clicking on the toggle. Try to solve them on your own first and then check your understanding by viewing the solution.
Sort your DataFrame first by ‘gender’ and then by ‘region’, both in descending order.
df.sort_values(by=['gender', 'region'], ascending=[False, False])
What would happen if you used the following code: df.sort_values(by=[‘region’, ‘gender’], ascending = [True, True, False])
A ValueError
would be raised since the two arrays are of different lengths.
Using sorting, how would you get the second-highest sales across all regions?
You could combine .sort_values()
with .iloc
:
print(
df.sort_values(
by='sales', # Sort by Sales
ascending=False # Sort in descending order
).iloc[1, 3]) # Access the second item in the fourth column
Conclusion and Recap
In this tutorial, you learned how to sort your Pandas DataFrame using the .sort_values()
method. As a quick refresher:
- The Pandas
.sort_values()
method allows you to sort a dataframe by one or by multiple columns - The default sort method is in ascending order placing missing values at the end
- You can establish different hierarchies by sorting by multiple columns
- Ignoring your index allows you to build a tidier DataFrame
- The operation can also be done in place, allowing you to sort without needing to re-assign your values
To learn more about the .sort_values()
method, check out the official documentation here.
Pingback: Creating Pivot Tables in Pandas with Python for Python and Pandas • datagy