In this tutorial, you’ll learn how to get unique values in a Pandas DataFrame, including getting unique values for a single column and across multiple columns. Being able to understand how to work with unique values is an important skill for a data scientist or data engineer of any skill level.
By the end of this tutorial, you’ll have learned the following:
- How to use the Pandas
.unique()
method to get unique values in a Pandas DataFrame column - How to get unique values across multiple columns
- How to count unique values and generate frequency tables for unique values
- And more
Table of Contents
The Quick Answer: Use Pandas unique()
You can use the Pandas .unique()
method to get the unique values in a Pandas DataFrame column. The values are returned in order of appearance and are unsorted.
Take a look at the code block below for how this method works:
# Get Unique Values in a Pandas DataFrame Column
import pandas as pd
df = pd.DataFrame({'Education': ['Graduate','Graduate','Undergraduate','Postgraduate']})
unique_vals = df['Education'].unique()
print(unique_vals)
# Returns: ['Graduate' 'Undergraduate' 'Postgraduate']
If you’d like to learn more, read on! This guide will teach you the ins and outs of working with unique data in a Pandas DataFrame.
Real-World Applications of Unique Data
Let’s dive into some real-world applications of working with unique data and why it matters. Take a look at the sample DataFrame that we’re creating below. We’ll be using this dataset throughout the tutorial.
# Loading a Sample Dataset
import pandas as pd
dataset = {
'Education Status': ['Graduate','Graduate','Undergraduate','Postgraduate','Graduate','Undergraduate','Postgraduate','Graduate','Undergraduate','Postgraduate','Graduate','Undergraduate','Graduate','Postgraduate','Postgraduate'],
'Employment Status': ['Employed','employed','Unemployed','Employed','Employed','Unemployed','Employed','Employed','Employed','Employed','Unemployed','Employed','Employed','Employed','Employed'],
'Gender': ['F','M','M','F','M','F','M','F','M','F','M','F','M','F','F']}
df = pd.DataFrame(dataset)
print(df.head())
# Returns:
# Education Status Employment Status Gender
# 0 Graduate Employed F
# 1 Graduate employed M
# 2 Undergraduate Unemployed M
# 3 Postgraduate Employed F
# 4 Graduate Employed M
Understanding unique data within a DataFrame allows you to understand:
- The data itself, such as what data are included and what data aren’t
- Whether or not data quality issues exist. For example, we can see that the
Employment Status
column has two capitalizations for the wordEmployed
. Understanding what unique values exist, allows us to better understand if we need to clean our data.
Let’s now dive into how to understand the Pandas .unique()
method.
Understanding the Pandas unique() Method
The unique() method in Pandas does not actually have any parameters itself. Instead, it is a Series-level function applied on a DataFrame column without any input parameters. When applied to a specific column of a DataFrame, it returns an array of unique values present in that column.
Here’s a breakdown of how the unique() method works:
- Select the column on which unique() will be applied by specifying the column name in brackets after the DataFrame name.
- Call the
unique()
method without any input parameters or arguments. - Obtain an array of unique values found in the selected column.
Let’s take a look at the unique() function using the sample dataset we created earlier.
Get Unique Values for a Pandas DataFrame Column
In order to get the unique values in a Pandas DataFrame column, you can simply apply the .unique()
method to the column. The method will return a NumPy array, in the order in which the values appear.
Let’s take a look at how we can get the unique values in the Education Status
column:
# Get Unique Values for a Column in Pandas
print(df['Education Status'].unique())
# Returns:
# ['Graduate' 'Undergraduate' 'Postgraduate']
In the example above, we applied the .unique()
method to the df['Education Status']
column. This returned the three unique values as a NumPy Array.
Let’s explore how we can return the unique values as a list in the next section.
Get Unique Values for a Pandas Column as a List
By default, the Pandas .unique()
method returns a NumPy array of the unique values. In order to return a list instead, we can apply the .tolist()
method to the array to convert it to a Python list.
Let’s see what this looks like:
# Get Unique Values for a Column in Pandas as a List
print(df['Education Status'].unique().tolist())
# Returns:
# ['Graduate' 'Undergraduate' 'Postgraduate']
In the example above, we applied the .tolist()
method to our NumPy array, converting it to a list.
Let’s now take a look at how we can get unique values for multiple Pandas DataFrame columns.
Get Unique Values for Multiple Pandas DataFrame Columns
By default, the Pandas .unique()
method can only be applied to a single column. This is because the method is a Pandas Series method, rather than a DataFrame method.
In order to get the unique values of multiple DataFrame columns, we can use the .drop_duplicates()
method. This will return a DataFrame of all of the unique combinations.
Let’s take a look at what this looks like:
# Get Unique Values for Multiple DataFrame Columns
unique = df[['Education Status', 'Gender']].drop_duplicates()
print(unique)
# Returns:
# Education Status Gender
# 0 Graduate Female
# 1 Graduate Male
# 2 Undergraduate Male
# 3 Postgraduate Female
# 5 Undergraduate Female
# 6 Postgraduate Male
The Pandas .drop_duplicates()
method can be a helpful way to identify only the unique values across two or more columns.
Count Unique Values in a Pandas DataFrame Column
In order to count how many unique values exist in a given DataFrame column (or columns), we can apply the .nunique()
method. The method will return a single value if applied to a single column, and a Pandas Series if applied to multiple columns.
Let’s see how we can use the .nunique()
method to count how many unique values exist in a column:
# Count Unique Values in a Pandas DataFrame Column
num_statuses = df['Employment Status'].nunique()
print(num_statuses)
# Returns: 3
The nunique method can be incredibly helpful to understand the number of unique values that exist in a column.
Count Occurrences of Unique Values in a Pandas DataFrame Column
In this section, we’ll explore how to count the occurrences of values across unique values. This, in essence, generates a frequency table for the unique values in a DataFrame column.
Let’s see how we can use the .value_counts()
method to count occurrences of unique values in a Pandas DataFrame column:
# Count Occurrences of Unique Values in a Pandas DataFrame Column
print(df['Education Status'].value_counts())
# Returns:
# Graduate 6
# Postgraduate 5
# Undergraduate 4
# Name: Education Status, dtype: int64
When we applied the .value_counts()
method to our DataFrame column, it returned a series in which each unique value is counted.
Frequently Asked Questions
The unique() method is is a Pandas method that is used to find the unique values in a Series object. It can be applied on a specific DataFrame column to return an array of unique values present in that column.
By default, the unique() method includes NaN values in its output array. In order to exclude missing values, you can first apply the .dropna() method to the column.
After using the unique() method to obtain the unique values in a DataFrame column, you can sort the resulting array by employing Python’s built-in sorted() function. This function accepts a sequence (such as the array returned by unique()) and returns a sorted list of elements.
To find the total number of unique values in a DataFrame column, use the nunique() method. It is applied the same way as unique() but returns an integer count of distinct values rather than a list of unique values.
Conclusion
In this tutorial, you learned how to get unique values in a Pandas DataFrame, including getting unique values for a single column and across multiple columns. You first learned how to get the unique values for a single column, as well as for multiple columns. Then, you learned how to count unique values, as well as the occurrences of unique values. To learn more about the .unique()
method, check out the official documentation.
Pingback: VLOOKUP in Python and Pandas using .map() or .merge() • datagy