In this tutorial, you’ll learn how to transform your Pandas DataFrame columns using vectorized functions and custom functions using the map and apply methods. By the end of this tutorial, you’ll have a strong understanding of how Pandas applies vectorized functions and how these are optimized for performance. You’ll also learn how to use custom functions to transform and manipulate your data using the .map()
and the .apply()
methods.
Mapping is a term that comes from mathematics. It refers to taking a function that accepts one set of values and maps them to another set of values. This is also a common exercise you’ll need to take on in your data science journey: creating new representations of your data or transforming data into a new format. Pandas provides a number of different ways to accomplish this, allowing you to work with vectorized functions, the .map()
method, and the .apply()
method.
Table of Contents
Loading a Sample Pandas DataFrame
To follow along with this tutorial, copy the code provided below to load a sample Pandas DataFrame. The dataset provides a number of helpful columns, allowing us to manipulate and transform our data in different ways.
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame({
'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
'age': [30, 40, 32, 67, 43],
'score': ['90%', '95%', '100%', '82%', '87%'],
'age_missing_data': [30, 40, 32, 67, None],
'income':[100000, 80000, 55000, 62000, 120000]
})
print(df)
# Returns:
# name age score age_missing_data income
# 0 James 30 90% 30.0 100000
# 1 Jane 40 95% 40.0 80000
# 2 Melissa 32 100% 32.0 55000
# 3 Ed 67 82% 67.0 62000
# 4 Neil 43 87% NaN 120000
The code above loads a DataFrame, df
, with five columns: name
and score
are both string types, age
and income
are both integers, and age_missing_data
is a floating-point value with a missing value included. The dataset is deliberately small so that you can better visualize what’s going on. Let’s get started!
Understanding Vectorized Functions in Pandas
While reading through Pandas documentation, you might encounter the term “vectorized”. In many cases, this will refer to functions or methods that are built into the library and are, therefore, optimized for speed and efficiency. The way that this works is that Pandas is able to leverage applying the same set of instructions for multiple pieces of data at the same time.
Why is this faster? Imagine a for-loop: in each iteration of a for loop, an action is repeated. Only once the action is completed, does the loop move onto the next iteration. Meanwhile, vectorization allows us to bypass this and move apply a function or transformation to multiple steps at the same time. This allows our computers to process our processes in parallel.
In fact, you’ve likely been using vectorized expressions, perhaps, without even knowing it! When you apply, say, .mean()
to a Pandas column, you’re applying a vectorized method. Let’s visualize how we could do this both with a for loop and with a vectorized function.
# Visualizing the Difference Between Vectorization and Scalar Operations
# Scalar Operations (Simplified using a for loop)
length = 0
age_sum = 0
for item in df['age']:
length += 1
age_sum += item
average_age_for_loop = age_sum / length
# Vectorized Implementation
average_age_vectorized = df['age'].mean()
Of course, the for loop method is significantly simplified compared to other methods you’ll learn below, but it brings the point home! There are also significant performance differences between these two implementations.
Using the Pandas map Method
You can apply the Pandas .map()
method can be applied to a Pandas Series, meaning it can be applied to a Pandas DataFrame column. The map function is interesting because it can take three different shapes. This varies depending on what you pass into the method. Let’s take a look at the types of objects that can be passed in:
- Dictionaries: Pandas will use the
.map()
method to map items pair-wise, based on akey:vale
pair - Functions: Pandas will apply the function row-wise, evaluating against the row’s value
- Series: Pandas will replace the Series to which the method is applied with the Series that’s passed in
In the following sections, you’ll dive deeper into each of these scenarios to see how the .map()
method can be used to transform and map a Pandas column.
Using the Pandas map Method to Map a Dictionary
When you pass a dictionary into a Pandas .map()
method will map in the values from the corresponding keys in the dictionary. This works very akin to the VLOOKUP
function in Excel and can be a helpful way to transform data.
For example, we could map in the gender of each person in our DataFrame by using the .map()
method. Let’s define a dictionary where the keys are the people and their corresponding gender are the keys’ values.
# Creating a dictionary of genders
genders = {'James': 'Male', 'Jane': 'Female', 'Melissa': 'Female', 'Ed': 'Male', 'Neil': 'Male'}
Now that we have our dictionary defined, we can apply the method to the name
column and pass in our dictionary, as shown below:
# Applying a dictionary to the map method
df['gender'] = df['name'].map(genders)
print(df)
# Returns:
# name age score age_missing_data income gender
# 0 James 30 90% 30.0 100000 Male
# 1 Jane 40 95% 40.0 80000 Female
# 2 Melissa 32 100% 32.0 55000 Female
# 3 Ed 67 82% 67.0 62000 Male
# 4 Neil 43 87% NaN 120000 Male
The Pandas .map()
method works similar to how you’d look up a value in another table while using the Excel VLOOKUP function.
Using the Pandas map Method to Map a Function
In this example, you’ll learn how to map in a function to a Pandas column. By doing this, the function we pass in expects a single value from the Series and returns a transformed version of that value. In this case, the .map()
method will return a completely new Series.
Let’s design a function that evaluates whether each person’s income is higher or lower than the average income. We’ll then apply that function using the .map()
method:
# Mapping in a custom function
mean_income = df['income'].mean()
def higher_income(x):
return x > mean_income
df['higher_than_avg_income'] = df['income'].map(higher_income)
print(df)
# Returns:
# name age score age_missing_data income higher_than_avg_income
# 0 James 30 90% 30.0 100000 True
# 1 Jane 40 95% 40.0 80000 False
# 2 Melissa 32 100% 32.0 55000 False
# 3 Ed 67 82% 67.0 62000 False
# 4 Neil 43 87% NaN 120000 True
Let’s break down what we did here:
- We calculated what the average income was an assigned it to the variable
mean_income
- We then defined a function which takes a single input. The input evaluates whether the input is greater or less than the mean value
- Finally, the function is mapped into the
income
column and used to generate a new DataFrame column
It may seem overkill to define a function only to use it a single time. Because of this, we can define an anonymous function. This is what you’ll learn in the following section.
Using the Pandas map Method to Map an Anonymous Lambda Function
Python allows us to define anonymous functions, lambda functions, which are functions that are defined without a name. This can be helpful when we need to use a function only a single time and want to simplify the use of the function. Let’s see how we can replicate the example above with the use of a lambda function:
# Mapping in an Anonymous Function
mean_income = df['income'].mean()
df['higher_than_avg_income'] = df['income'].map(lambda x: x > mean_income)
print(df)
# Returns:
# name age score age_missing_data income higher_than_avg_income
# 0 James 30 90% 30.0 100000 True
# 1 Jane 40 95% 40.0 80000 False
# 2 Melissa 32 100% 32.0 55000 False
# 3 Ed 67 82% 67.0 62000 False
# 4 Neil 43 87% NaN 120000 True
This process is a little cleaner for whoever may be reading your code. It makes it clear that the function exists only for the purpose of this single use.
Using the Pandas map Method to Map an Indexed Series
In this final example, you’ll learn how to pass in a Pandas Series into the .map()
method. This process overwrites any values in the Series to which it’s applied, using the values from the Series that’s passed in. This is a much simpler example, where data is simply overwritten. Let’s take a look at how this could work:
# Mapping in a Series
last_names = pd.Series(['Doe', 'Miller', 'Edwards', 'Nelson', 'Raul'], index=df['name'])
df['Last Name'] = df['name'].map(last_names)
print(df)
# Returns:
# name age score age_missing_data income Last Name
# 0 James 30 90% 30.0 100000 Doe
# 1 Jane 40 95% 40.0 80000 Miller
# 2 Melissa 32 100% 32.0 55000 Edwards
# 3 Ed 67 82% 67.0 62000 Nelson
# 4 Neil 43 87% NaN 120000 Raul
Let’s take a look at what we did here: we created a Pandas Series using a list of last names, passing in the 'name'
column from our DataFrame. This then completed a one-to-one match based on the index-column match.
Using the Pandas apply Method
Pandas also provides another method to map in a function, the .apply()
method. This method is different in a number of important ways:
- The
.apply()
method can be applied to either a Pandas Series or a Pandas DataFrame. The.map()
method is exclusive to being applied to a Pandas Series. - The
.apply()
method can only take a callable (i.e., a function) - It can be used to aggregate data, rather than simply mapping a transformation
Now that you know some of the key differences between the two methods, let’s dive into how to map a function into a Pandas DataFrame.
Using the Pandas apply Method to Apply a Function
The Pandas .apply()
method allows us to pass in a function that evaluates against either a Series or an entire DataFrame. Because of this, let’s take a look at an example where we evaluate against more than a single Series (which we could accomplish with .map()
). Let’s look at creating a column that takes into account the age and income columns. If a person is under 45 and makes more than 75,000, we’ll call them for an interview:
# Applying a function to an entire dataframe
def interview(row):
return row['age'] < 45 and row['income'] > 75000
df['interview'] = df.apply(interview, axis=1)
print(df)
# Returns:
# name age score age_missing_data income interview
# 0 James 30 90% 30.0 100000 True
# 1 Jane 40 95% 40.0 80000 True
# 2 Melissa 32 100% 32.0 55000 False
# 3 Ed 67 82% 67.0 62000 False
# 4 Neil 43 87% NaN 120000 True
We can see that we’re able to apply a function that takes into account more than one column! This can open up some significant potential.
Passing in Arguments with Pandas apply
One of the less intuitive ways we can use the .apply()
method is by passing in arguments. Because we pass in only the callable (i.e., the function name without parentheses), there’s no intuitive way of passing in arguments. Let’s define a function where we may want to modify its behavior by making use of arguments:
# Passing in arguments into an .apply method
def bonus(row, amount, give=False):
if give:
return row['income'] / row['age'] * amount
else:
return 0
df['bonus'] = df.apply(bonus, args = (0.25,), give = True, axis=1)
print(df)
# Returns:
# name age score age_missing_data income bonus
# 0 James 30 90% 30.0 100000 833.333333
# 1 Jane 40 95% 40.0 80000 500.000000
# 2 Melissa 32 100% 32.0 55000 429.687500
# 3 Ed 67 82% 67.0 62000 231.343284
# 4 Neil 43 87% NaN 120000 697.674419
The benefit of this approach is that we can define the function once. This allows us to modify the behavior depending on certain conditions being met. For example, in the example above, we can either choose to give a bonus or not.
Performance Implications of Pandas map and apply
If you’ve been following along with the examples, you might have noticed that all the examples ran in roughly the same amount of time. That’s in large part because the dataset we used was so small. If we were to try some of these methods on larger datasets, you may run into some performance implications.
This is because, like our for-loop example earlier, these methods iterate over each row of the DataFrame. It’s important to try and optimize your code for speed, especially when working with larger datasets. Because of this, it’s often better to try and find a built-in Pandas function, rather than applying your own.
For example, we could convert an earlier .map()
example to a more native approach. Let’s convert whether a person’s income is higher than the average income by using a built-in vectorized format:
# Old Format
mean_income = df['income'].mean()
df['higher_than_avg_income'] = df['income'].map(lambda x: x > mean_income)
# Vectorized Format
df['higher_than_avg_income'] = df['income'] > mean_income
Performance may not seem like a big deal when starting out, but each step we take to modify our data will add time to our overall work. When working with significantly larger datasets, it’s important to keep performance in mind. It can often help to start with one process and then try different, faster ways to achieve the same end.
Exercises
It’s time to test your learning. Try and complete the exercises below. You can find a sample solution by toggling the section:
Create a column that converts the string percent column to a ratio.
df['percent'] = df['score'].map(lambda x: int(x.replace('%', '')))
print(df)
# Returns:
# name age score age_missing_data income percent
# 0 James 30 90% 30.0 100000 90
# 1 Jane 40 95% 40.0 80000 95
# 2 Melissa 32 100% 32.0 55000 100
# 3 Ed 67 82% 67.0 62000 82
# 4 Neil 43 87% NaN 120000 87
Convert this into a vectorized format: df[‘perc_of_total’] = df[‘income’].map(lambda x: x / df[‘income’].sum())
total_income = df['income'].sum()
df['perc_of_total'] = df['income'] / total_income
print(df)
# name age score age_missing_data income perc_of_total
# 0 James 30 90% 30.0 100000 0.239808
# 1 Jane 40 95% 40.0 80000 0.191847
# 2 Melissa 32 100% 32.0 55000 0.131894
# 3 Ed 67 82% 67.0 62000 0.148681
# 4 Neil 43 87% NaN 120000 0.287770
Conclusion and Recap
In this tutorial, you learned how to analyze and transform your Pandas DataFrame using vectorized functions, and the .map()
and .apply()
methods. The section below provides a recap of everything you’ve learned:
- Pandas provides a wide array of solutions to modify your DataFrame columns
- Vectorized, built-in functions allow you to apply functions in parallel, applying them to multiple records at the same time
- The Pandas
.map()
method can pass in a dictionary to map values to a dictionaries keys - The Pandas
.map()
method can pass in a Series to map values in that Series based on its index - The Pandas
.map()
method can pass in a function to apply a function to a single column - The Pandas
.apply()
method can pass a function to either a single column or an entire DataFrame .map()
and.apply()
have performance considerations beyond built-in vectorized functions. Be careful with performance hogs!
Additional Resources
Check out the tutorials below for related topics:
Hello, there is a small error in the # Scalar Operations (Simplified using a for loop) example. for item in df[‘ages’]: should be for item in df[‘age’]:
Thank you so much Dup! I have made the change. I really appreciate it 🙂