In this tutorial, you’ll learn how to use Python and Pandas to iterate over a Pandas dataframe rows.
The tutorial will begin by explore why iterating over Pandas dataframe rows is often not necessary and is often much slower than alternatives like vectorization. That being said, there are times where you may need to iterate over a Pandas dataframe rows – because of this, we’ll explore four different methods by which you can do this. You’ll learn how to use the Pandas .iterrows()
, .itertuples()
, and .items()
methods. You’ll also learn how to use Python for loops to loop over each row in a Pandas dataframe.
The Quick Answer: Use Pandas .iterrows()
Table of Contents
Why Iterating Over Pandas Dataframe Rows is a Bad Idea
Pandas itself warns against iterating over dataframe rows. The official documentation indicates that in most cases it actually isn’t needed, and any dataframe over 1,000 records will begin noticing significant slow downs. Pandas recommends using either vectorization if possible. If, however, you need to apply a specific formula, then using the .apply()
method is an attactive alternative.
While iterating over rows may seem like a logical tool for those coming from tools like Excel, however, many processes can be much better applied. Iterating over rows, unless necessary, is a bad habit to fall into.
In order of preference, my recommended approach is to:
- Vectorize if possible,
- Use the
.apply()
method if you need to apply a function that requires row-level information
The alternatives listed above are much more idiomatic and easier to read. While using the .apply()
method is slower than vectorization, it can often be easier for beginners to wrap their heads around.
Loading a Sample Pandas Dataframe
If you want to follow along with a sample dataframe, feel free to copy the code below. We’ll load a small dataframe so that we can print it out in its entirety. You likely won’t encounter any major performance hiccups running this dataframe, but they’ll become more and more noticeable as your dataset grows.
Let’s start by loading the data and printing it out.
import pandas as pd
df = pd.DataFrame.from_dict(
{
'Year': [2018, 2019, 2020, 2021],
'Sales': [1000, 2300, 1900, 3400],
}
)
print(df)
# Returns:
# Year Sales
# 0 2018 1000
# 1 2019 2300
# 2 2020 1900
# 3 2021 3400
In the next section, you’ll learn how to vectorize your dataframe operations in order to save some memory and time!
How to Vectorize Instead of Iterating Over Rows
In this section, you’ll learn (albeit, very briefly), how to vectorize a dataframe operation.
In the example below, you’ll learn how to square a number in a column. If you were to iterate over each row, you would perform the calculation as many times as there are records in the column. By vectorizing, however, you can apply a transformation directly to a column.
Let’s see what vectorization looks like by using some Python code:
df['Sales Squared'] = df['Sales'] ** 2
print(df)
# Returns:
# Year Sales Sales Squared
# 0 2018 1000 1000000
# 1 2019 2300 5290000
# 2 2020 1900 3610000
# 3 2021 3400 11560000
Now that you know how to apply vectorization to a data, let’s explore how to use the Pandas .iterrows()
method to iterate over a Pandas dataframe rows.
How to Use Pandas iterrows to Iterate over a Dataframe Rows
To actually iterate over Pandas dataframes rows, we can use the Pandas .iterrows()
method. The method generates a tuple-based generator object. This means that each tuple contains an index (from the dataframe) and the row’s values. One important this to note here, is that .iterrows()
does not maintain data types. If you want to maintain data types, check out the next section on .itertuples()
.
Let’s see how the .iterrows()
method works:
# Use .iterrows() to iterate over Pandas rows
for idx, row in df.iterrows():
print(idx, row['Year'], row['Sales'])
# Returns:
# 0 2018 1000
# 1 2019 2300
# 2 2020 1900
# 3 2021 3400
As you can see, the method above generates a tuple, which we can unpack. The first item contains the index of the row and the second is a Pandas series containing the row’s data.
The .iterrows()
method is quite slow because it needs to generate a Pandas series for each row.
in the next section, you’ll learn how to use the .itertuples()
method to loop over a Pandas dataframe’s rows.
How to Use Pandas itertuples to Iterate over a Dataframe Rows
The .itertuples()
is an interesting method that, like the .iterrows()
method, returns a generator object of each row in a Pandas dataframe.
Unlike the previous method, the .itertuples()
method returns a named tuple
for each row in the dataframe. A named tuple is much like a normal tuple, only that each item is given an attribute name.
Let’s take a look at what this looks like by printing out each named tuple returned by the .itertuples()
method:
# Use .iterrows() to iterate over dataframe rows
for row in df.itertuples():
print(row)
# Returns:
# Pandas(Index=0, Year=2018, Sales=1000)
# Pandas(Index=1, Year=2019, Sales=2300)
# Pandas(Index=2, Year=2020, Sales=1900)
# Pandas(Index=3, Year=2021, Sales=3400)
We can see that each item in the tuple is given an attribute name. We can access the tuples’ items by calling its attribute.
Let’s see how we can print out each row’s Year
attribute in Python:
# Use .itertuples() to iterate over dataframe rows
for row in df.itertuples():
print(row.Year)
# Returns:
# 2018
# 2019
# 2020
# 2021
In the next section, you’ll learn how to use the .items()
method to loop over a dataframe’s items in Pandas.
How to Use Pandas items to Iterate over a Dataframe Rows
The Pandas .items()
method lets you access each item in a Pandas row. It generates generator objects for each column and their items.
This, of course, takes even longer as it first needs to generate a generator, not just for each row, but for each column.
Let’s take a look at what this looks like:
# Use .items() to iterate over dataframe rows
for column_name, data in df.items():
print(column_name, data)
# Returns:
# Sales 0 1000
# 1 2300
# 2 1900
# 3 3400
# Name: Sales, dtype: int64
In the next section, you’ll learn how to use a Python for loop to loop over a Pandas dataframe’s rows.
How to Use a For Loop to Iterate over a Pandas Dataframe Rows
In this final section, you’ll learn how to use a Python for loop to loop over a Pandas dataframe’s rows.
We can use the Pandas .iloc
accessor to access different rows while looping over the length of the for loop.
Let’s see what this method looks like in Python:
for i in range(len(df)):
print(df.iloc[i, :])
# Returns:
# Year 2018
# Sales 1000
# Name: 0, dtype: int64
# Year 2019
# Sales 2300
# Name: 1, dtype: int64
# Year 2020
# Sales 1900
# Name: 2, dtype: int64
# Year 2021
# Sales 3400
# Name: 3, dtype: int64
You could also access just a column, or a set of columns, by not just using the :
. To learn more about the iloc
accessor, check out my in-depth tutorial here.
Conclusion
In this tutorial, you learned all about iterating over rows in a Pandas dataframe. You began by learning why iterating over a dataframe row by row is a bad idea, and why vectorization is a much better alternative for most tasks. You also learned how to iterate over rows in a Pandas dataframe using three different dataframe methods as well as a for loop using the dataframe index.
To learn more about the Pandas .iterrows()
method, check out the official documentation here.
Pingback: Pandas Shift: Shift a Dataframe Column Up or Down • datagy