Skip to content

Transforming Pandas Columns with map and apply

Transforming Pandas Columns with map and apply Cover Image

In this tutorial, you’ll learn how to transform your Pandas DataFrame columns using vectorized functions and custom functions using the map and apply methods. By the end of this tutorial, you’ll have a strong understanding of how Pandas applies vectorized functions and how these are optimized for performance. You’ll also learn how to use custom functions to transform and manipulate your data using the .map() and the .apply() methods.

Mapping is a term that comes from mathematics. It refers to taking a function that accepts one set of values and maps them to another set of values. This is also a common exercise you’ll need to take on in your data science journey: creating new representations of your data or transforming data into a new format. Pandas provides a number of different ways to accomplish this, allowing you to work with vectorized functions, the .map() method, and the .apply() method.

Loading a Sample Pandas DataFrame

To follow along with this tutorial, copy the code provided below to load a sample Pandas DataFrame. The dataset provides a number of helpful columns, allowing us to manipulate and transform our data in different ways.

# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'age': [30, 40, 32, 67, 43],
    'score': ['90%', '95%', '100%', '82%', '87%'],
    'age_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})
print(df)

# Returns:
#       name  age score  age_missing_data  income
# 0    James    30   90%               30.0  100000
# 1     Jane    40   95%               40.0   80000
# 2  Melissa    32  100%               32.0   55000
# 3       Ed    67   82%               67.0   62000
# 4     Neil    43   87%                NaN  120000

The code above loads a DataFrame, df, with five columns: name and score are both string types, age and income are both integers, and age_missing_data is a floating-point value with a missing value included. The dataset is deliberately small so that you can better visualize what’s going on. Let’s get started!

Understanding Vectorized Functions in Pandas

While reading through Pandas documentation, you might encounter the term “vectorized”. In many cases, this will refer to functions or methods that are built into the library and are, therefore, optimized for speed and efficiency. The way that this works is that Pandas is able to leverage applying the same set of instructions for multiple pieces of data at the same time.

Why is this faster? Imagine a for-loop: in each iteration of a for loop, an action is repeated. Only once the action is completed, does the loop move onto the next iteration. Meanwhile, vectorization allows us to bypass this and move apply a function or transformation to multiple steps at the same time. This allows our computers to process our processes in parallel.

In fact, you’ve likely been using vectorized expressions, perhaps, without even knowing it! When you apply, say, .mean() to a Pandas column, you’re applying a vectorized method. Let’s visualize how we could do this both with a for loop and with a vectorized function.

# Visualizing the Difference Between Vectorization and Scalar Operations
# Scalar Operations (Simplified using a for loop)
length = 0
age_sum = 0
for item in df['age']:
    length += 1
    age_sum += item

average_age_for_loop = age_sum / length

# Vectorized Implementation
average_age_vectorized = df['age'].mean()

Of course, the for loop method is significantly simplified compared to other methods you’ll learn below, but it brings the point home! There are also significant performance differences between these two implementations.

Using the Pandas map Method

You can apply the Pandas .map() method can be applied to a Pandas Series, meaning it can be applied to a Pandas DataFrame column. The map function is interesting because it can take three different shapes. This varies depending on what you pass into the method. Let’s take a look at the types of objects that can be passed in:

  1. Dictionaries: Pandas will use the .map() method to map items pair-wise, based on a key:vale pair
  2. Functions: Pandas will apply the function row-wise, evaluating against the row’s value
  3. Series: Pandas will replace the Series to which the method is applied with the Series that’s passed in

In the following sections, you’ll dive deeper into each of these scenarios to see how the .map() method can be used to transform and map a Pandas column.

Using the Pandas map Method to Map a Dictionary

When you pass a dictionary into a Pandas .map() method will map in the values from the corresponding keys in the dictionary. This works very akin to the VLOOKUP function in Excel and can be a helpful way to transform data.

For example, we could map in the gender of each person in our DataFrame by using the .map() method. Let’s define a dictionary where the keys are the people and their corresponding gender are the keys’ values.

# Creating a dictionary of genders
genders = {'James': 'Male', 'Jane': 'Female', 'Melissa': 'Female', 'Ed': 'Male', 'Neil': 'Male'}

Now that we have our dictionary defined, we can apply the method to the name column and pass in our dictionary, as shown below:

# Applying a dictionary to the map method
df['gender'] = df['name'].map(genders)
print(df)

# Returns:
#       name  age score  age_missing_data  income  gender
# 0    James   30   90%              30.0  100000    Male
# 1     Jane   40   95%              40.0   80000  Female
# 2  Melissa   32  100%              32.0   55000  Female
# 3       Ed   67   82%              67.0   62000    Male
# 4     Neil   43   87%               NaN  120000    Male

The Pandas .map() method works similar to how you’d look up a value in another table while using the Excel VLOOKUP function.

Using the Pandas map Method to Map a Function

In this example, you’ll learn how to map in a function to a Pandas column. By doing this, the function we pass in expects a single value from the Series and returns a transformed version of that value. In this case, the .map() method will return a completely new Series.

Let’s design a function that evaluates whether each person’s income is higher or lower than the average income. We’ll then apply that function using the .map() method:

# Mapping in a custom function
mean_income = df['income'].mean()

def higher_income(x):
    return x > mean_income

df['higher_than_avg_income'] = df['income'].map(higher_income)
print(df)

# Returns:
#       name  age score  age_missing_data  income  higher_than_avg_income
# 0    James   30   90%              30.0  100000                    True
# 1     Jane   40   95%              40.0   80000                   False
# 2  Melissa   32  100%              32.0   55000                   False
# 3       Ed   67   82%              67.0   62000                   False
# 4     Neil   43   87%               NaN  120000                    True

Let’s break down what we did here:

  1. We calculated what the average income was an assigned it to the variable mean_income
  2. We then defined a function which takes a single input. The input evaluates whether the input is greater or less than the mean value
  3. Finally, the function is mapped into the income column and used to generate a new DataFrame column

It may seem overkill to define a function only to use it a single time. Because of this, we can define an anonymous function. This is what you’ll learn in the following section.

Using the Pandas map Method to Map an Anonymous Lambda Function

Python allows us to define anonymous functions, lambda functions, which are functions that are defined without a name. This can be helpful when we need to use a function only a single time and want to simplify the use of the function. Let’s see how we can replicate the example above with the use of a lambda function:

# Mapping in an Anonymous Function
mean_income = df['income'].mean()
df['higher_than_avg_income'] = df['income'].map(lambda x: x > mean_income)
print(df)

# Returns:
#       name  age score  age_missing_data  income  higher_than_avg_income
# 0    James   30   90%              30.0  100000                    True
# 1     Jane   40   95%              40.0   80000                   False
# 2  Melissa   32  100%              32.0   55000                   False
# 3       Ed   67   82%              67.0   62000                   False
# 4     Neil   43   87%               NaN  120000                    True

This process is a little cleaner for whoever may be reading your code. It makes it clear that the function exists only for the purpose of this single use.

Using the Pandas map Method to Map an Indexed Series

In this final example, you’ll learn how to pass in a Pandas Series into the .map() method. This process overwrites any values in the Series to which it’s applied, using the values from the Series that’s passed in. This is a much simpler example, where data is simply overwritten. Let’s take a look at how this could work:

# Mapping in a Series
last_names = pd.Series(['Doe', 'Miller', 'Edwards', 'Nelson', 'Raul'], index=df['name'])
df['Last Name'] = df['name'].map(last_names)

print(df)

# Returns:
#       name  age score  age_missing_data  income Last Name
# 0    James   30   90%              30.0  100000       Doe
# 1     Jane   40   95%              40.0   80000    Miller
# 2  Melissa   32  100%              32.0   55000   Edwards
# 3       Ed   67   82%              67.0   62000    Nelson
# 4     Neil   43   87%               NaN  120000      Raul

Let’s take a look at what we did here: we created a Pandas Series using a list of last names, passing in the 'name' column from our DataFrame. This then completed a one-to-one match based on the index-column match.

Using the Pandas apply Method

Pandas also provides another method to map in a function, the .apply() method. This method is different in a number of important ways:

  1. The .apply() method can be applied to either a Pandas Series or a Pandas DataFrame. The .map() method is exclusive to being applied to a Pandas Series.
  2. The .apply() method can only take a callable (i.e., a function)
  3. It can be used to aggregate data, rather than simply mapping a transformation

Now that you know some of the key differences between the two methods, let’s dive into how to map a function into a Pandas DataFrame.

Using the Pandas apply Method to Apply a Function

The Pandas .apply() method allows us to pass in a function that evaluates against either a Series or an entire DataFrame. Because of this, let’s take a look at an example where we evaluate against more than a single Series (which we could accomplish with .map()). Let’s look at creating a column that takes into account the age and income columns. If a person is under 45 and makes more than 75,000, we’ll call them for an interview:

# Applying a function to an entire dataframe
def interview(row):
    return row['age'] < 45 and row['income'] > 75000

df['interview'] = df.apply(interview, axis=1)
print(df)

# Returns:
#       name  age score  age_missing_data  income  interview
# 0    James   30   90%              30.0  100000       True
# 1     Jane   40   95%              40.0   80000       True
# 2  Melissa   32  100%              32.0   55000      False
# 3       Ed   67   82%              67.0   62000      False
# 4     Neil   43   87%               NaN  120000       True

We can see that we’re able to apply a function that takes into account more than one column! This can open up some significant potential.

Passing in Arguments with Pandas apply

One of the less intuitive ways we can use the .apply() method is by passing in arguments. Because we pass in only the callable (i.e., the function name without parentheses), there’s no intuitive way of passing in arguments. Let’s define a function where we may want to modify its behavior by making use of arguments:

# Passing in arguments into an .apply method
def bonus(row, amount, give=False):
    if give:
        return row['income'] / row['age'] * amount
    else:
        return 0

df['bonus'] = df.apply(bonus, args = (0.25,), give = True, axis=1)
print(df)

# Returns:
#       name  age score  age_missing_data  income       bonus
# 0    James   30   90%              30.0  100000  833.333333
# 1     Jane   40   95%              40.0   80000  500.000000
# 2  Melissa   32  100%              32.0   55000  429.687500
# 3       Ed   67   82%              67.0   62000  231.343284
# 4     Neil   43   87%               NaN  120000  697.674419

The benefit of this approach is that we can define the function once. This allows us to modify the behavior depending on certain conditions being met. For example, in the example above, we can either choose to give a bonus or not.

Performance Implications of Pandas map and apply

If you’ve been following along with the examples, you might have noticed that all the examples ran in roughly the same amount of time. That’s in large part because the dataset we used was so small. If we were to try some of these methods on larger datasets, you may run into some performance implications.

This is because, like our for-loop example earlier, these methods iterate over each row of the DataFrame. It’s important to try and optimize your code for speed, especially when working with larger datasets. Because of this, it’s often better to try and find a built-in Pandas function, rather than applying your own.

For example, we could convert an earlier .map() example to a more native approach. Let’s convert whether a person’s income is higher than the average income by using a built-in vectorized format:

# Old Format
mean_income = df['income'].mean()
df['higher_than_avg_income'] = df['income'].map(lambda x: x > mean_income)

# Vectorized Format
df['higher_than_avg_income'] = df['income'] > mean_income

Performance may not seem like a big deal when starting out, but each step we take to modify our data will add time to our overall work. When working with significantly larger datasets, it’s important to keep performance in mind. It can often help to start with one process and then try different, faster ways to achieve the same end.

Exercises

It’s time to test your learning. Try and complete the exercises below. You can find a sample solution by toggling the section:

Create a column that converts the string percent column to a ratio.

df['percent'] = df['score'].map(lambda x: int(x.replace('%', '')))
print(df)

# Returns:
#       name  age score  age_missing_data  income  percent
# 0    James   30   90%              30.0  100000       90
# 1     Jane   40   95%              40.0   80000       95
# 2  Melissa   32  100%              32.0   55000      100
# 3       Ed   67   82%              67.0   62000       82
# 4     Neil   43   87%               NaN  120000       87

Convert this into a vectorized format: df[‘perc_of_total’] = df[‘income’].map(lambda x: x / df[‘income’].sum())

total_income = df['income'].sum()
df['perc_of_total'] = df['income'] / total_income

print(df)
#       name  age score  age_missing_data  income  perc_of_total
# 0    James   30   90%              30.0  100000       0.239808
# 1     Jane   40   95%              40.0   80000       0.191847
# 2  Melissa   32  100%              32.0   55000       0.131894
# 3       Ed   67   82%              67.0   62000       0.148681
# 4     Neil   43   87%               NaN  120000       0.287770

Conclusion and Recap

In this tutorial, you learned how to analyze and transform your Pandas DataFrame using vectorized functions, and the .map() and .apply() methods. The section below provides a recap of everything you’ve learned:

  • Pandas provides a wide array of solutions to modify your DataFrame columns
  • Vectorized, built-in functions allow you to apply functions in parallel, applying them to multiple records at the same time
  • The Pandas .map() method can pass in a dictionary to map values to a dictionaries keys
  • The Pandas .map() method can pass in a Series to map values in that Series based on its index
  • The Pandas .map() method can pass in a function to apply a function to a single column
  • The Pandas .apply() method can pass a function to either a single column or an entire DataFrame
  • .map() and .apply() have performance considerations beyond built-in vectorized functions. Be careful with performance hogs!

Additional Resources

Check out the tutorials below for related topics:

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

2 thoughts on “Transforming Pandas Columns with map and apply”

  1. Hello, there is a small error in the # Scalar Operations (Simplified using a for loop) example. for item in df[‘ages’]: should be for item in df[‘age’]:

Leave a Reply

Your email address will not be published. Required fields are marked *