In this tutorial, you’ll learn how to combine data in Pandas by merging, joining, and concatenating DataFrames. You’ll learn how to perform database-style merging of DataFrames based on common columns or indices using the merge()
function and the .join()
method. You’ll also learn how to combine datasets by concatenating multiple DataFrames with similar columns.
Table of Contents
Different Ways to Combine Data
There are a number of different ways in which you may want to combine data. For example, you can combine datasets by concatenating them. This process involves combining datasets together by including the rows of one dataset underneath the rows of the other. This process will be referred to as concatenating or appending datasets.
There are a number of ways in which you can concatenate datasets. For example, you can require that all datasets have the same columns. On the other hand, you can choose to include any mismatched columns as well, thereby introducing the potential for including missing data.
Generally, the process of concatenating datasets will make your dataset longer, rather than wider. However, if you are comfortable with appending datasets with mismatched columns, the resulting dataset may also grow wider. Concatenating datasets focuses on merging based on columns, rather than based on records. This, is a generalization and is not always true!
You can also join datasets together, to make a larger one. This is what will be referred to as either joining or merging datasets in this tutorial. When you join a dataset with another, you are merging these sets based on a key (or keys).
Generally speaking, the process of merging datasets will focus on making the dataset wider, rather than longer. Depending on the overall between records, however, and the method of merging you choose, you may also introduce more rows. Merging datasets focuses on merging based on the records’ values, rather than based on column headers.
This tutorial is split into two main sections: concatenating DataFrames and merging DataFrames. Let’s begin by learning how to concatenate, or append, multiple DataFrames into a single DataFrame.
Concatenating Pandas DataFrames
By concatenating DataFrames, you stitch together datasets along an axis – either rows or columns. Throughout this tutorial, we’ll explore how to stitch Pandas DataFrames together using column-wise combining. When we concatenate or append DataFrames, we need to make some decisions. These decisions can include determining whether to maintain the original indices, add additional helpful keys, and more.
Loading Sample Pandas DataFrames
To follow along with this tutorial, let’s load a few DataFrames to better understand the nuances of some how to combine Pandas DataFrames. We’ll load three different DataFrames with some overlap:
# Loading Sample Pandas DataFrames
import pandas as pd
df1 = pd.DataFrame.from_dict({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c'], 'col3': ['a1', 'b2', 'c3']})
df2 = pd.DataFrame.from_dict({'col1': [4, 5, 6], 'col2': ['d', 'e', 'f'], 'col3': ['d4', 'e5', 'f6']})
df3 = pd.DataFrame.from_dict({'col1': [7, 8, 9], 'col2': ['g', 'h', 'i'], 'col4': ['g7', 'h2', 'i3']})
print('df1 looks like:')
print(df1)
print('\ndf2 looks like:')
print(df2)
print('\ndf3 looks like:')
print(df3)
# Returns:
# df1 looks like:
# col1 col2 col3
# 0 1 a a1
# 1 2 b b2
# 2 3 c c3
# df2 looks like:
# col1 col2 col3
# 0 4 d d4
# 1 5 e e5
# 2 6 f f6
# df3 looks like:
# col1 col2 col4
# 0 7 g g7
# 1 8 h h2
# 2 9 i i3
We can see that we have three basic DataFrames, each with three rows. The first two DataFrames have columns that overlap in entirety, while the third has a column that doesn’t exist in the first two.
Understanding the Pandas concat Function
Let’s take a look at the Pandas concat()
function, which can be used to combine DataFrames together. The function looks like this:
# Understanding the Pandas concat() function
import pandas as pd
pd.concat(
objs, # The Series or DataFrame objects to concatenate
axis=0, # The axis to concatenate along
join='outer', # How to handle additional data elements
ignore_index=False, # Whether to maintain the original index or not
keys=None, # when multiple levels are passed
levels=None, # Levels to use to construct a multi-index
names=None, # Names for levels to use in multi-index
verify_integrity=False, # Check if the new axis contains duplicates
sort=False, # Sort the non-concatenation axis, if columns aren't aligned
copy=True # If False, do not copy data unnecessarily
)
The only parameter that doesn’t have a default argument is the objs
parameter. This contains a list of DataFrames or Series that you want to concatenate. The function returns either a Series or a DataFrame, depending on what is passed in.
The method gives a ton of different parameters to help customize how the data will be concatenated. In many cases, you’ll get by even if you don’t fully understand all of them. However, it’s good to know that they’re there and what they do in case you do need them for your use case.
Concatenating DataFrames with Pandas concat
Let’s start by looking at the most straightforward example: concatenating two DataFrames with the same columns. Since the function only requires you to pass in the objects you want to concatenate, you can simply pass in the list of objects. In this case, we can pass in [df1, df2]
. Let’s see what this results in:
# Concatenating simple DataFrames
df_concat = pd.concat([df1, df2])
print(df_concat)
# Returns:
# col1 col2 col3
# 0 1 a a1
# 1 2 b b2
# 2 3 c c3
# 0 4 d d4
# 1 5 e e5
# 2 6 f f6
Because we didn’t pass in any arguments, the DataFrames were concatenated as they were. Because of this, the original index values were included. Let’s see how you can use the ignore_index=
parameter to not preserve the original index from different DataFrames. This results in the indices being relabeled from 0 through to the end of the length.
# Ignoring an index when appending
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)
# Returns:
# col1 col2 col3
# 0 1 a a1
# 1 2 b b2
# 2 3 c c3
# 3 4 d d4
# 4 5 e e5
# 5 6 f f6
This can be a much cleaner approach when your index is simply the row label.
Concatenating DataFrames with Mixed Columns
In the example above, all the columns between the two DataFrames matched exactly. This is not always the case. In this section, you’ll learn how concatenating two DataFrames with different columns will work. Let’s give this a shot using [df1, df3]
as our objects.
# Concatenating DataFrames with Different Columns
df_concat = pd.concat([df1, df3], ignore_index=True)
print(df_concat)
# Returns:
# col1 col2 col3 col4
# 0 1 a a1 NaN
# 1 2 b b2 NaN
# 2 3 c c3 NaN
# 3 7 g NaN g7
# 4 8 h NaN h2
# 5 9 i NaN i3
In the example above, the two DataFrames were concatenated. However, since there were columns that didn’t exist in both DataFrames, the columns were added with missing values.
The reason these values were included is because the default argument for the join=
parameter is 'outer'
. Because of this, all values across the objects are maintained. The other potential argument for this is 'inner'
, which will only concatenate the objects with overlapping matches. Let’s see what this would return:
# Changing to an inner concatenation
df_concat = pd.concat([df1, df3], ignore_index=True, join='inner')
print(df_concat)
# Returns:
# col1 col2
# 0 1 a
# 1 2 b
# 2 3 c
# 3 7 g
# 4 8 h
# 5 9 i
Because the data is joined using the column-axis, the columns that do not overlap between all the objects are dropped. In this case, only col1
and col2
are included in the resulting DataFrame.
Concatenating Multiple DataFrames at Once
The Pandas concat
function also allows you to concatenate more than two DataFrames at a single time. This can be incredibly helpful when you want to append multiple DataFrames. Let’s see how we can pass in all the DataFrames we currently have:
# Concatenating Multiple DataFrames
df_concat = pd.concat([df1, df2, df3], ignore_index=True)
print(df_concat)
# Returns:
# col1 col2 col3 col4
# 0 1 a a1 NaN
# 1 2 b b2 NaN
# 2 3 c c3 NaN
# 3 4 d d4 NaN
# 4 5 e e5 NaN
# 5 6 f f6 NaN
# 6 7 g NaN g7
# 7 8 h NaN h2
# 8 9 i NaN i3
This lets you easily combine multiple DataFrames.
Concatenating DataFrames with the append
Another method that you have available is the Pandas .append()
method. When applied to a DataFrame, you can pass in another DataFrame to append it. The method is a shortcut to the concat()
function, which gives you significant flexibility.
Because this method is a convenience method, the method can only append one DataFrame to another. In order to append multiple DataFrames, you do need to use the concat()
function.
Let’s see how we can append df1
and df2
using the .append()
method:
# Appending Two DataFrames with .append()
df_append = df1.append(df2, ignore_index=True)
print(df_append)
# Returns:
# col1 col2 col3
# 0 1 a a1
# 1 2 b b2
# 2 3 c c3
# 3 4 d d4
# 4 5 e e5
# 5 6 f f6
The method doesn’t take nearly as many parameters of the concat()
function. However, one parameter to be aware of is the sort=
parameter. This defaults to False
, but if toggled to True
, then Pandas will first sort the columns in the DataFrames to help them be in the same order.
Merging Data with Pandas merge
Pandas handles database-like joining operations with great flexibility. While, on the surface, the function works quite elegantly, there is a lot of flexibility under the hood. For example, you can complete many different merge types (such as inner, outer, left, and right) and merge on a single key or multiple keys.
Understanding Different Merge Types
Let’s take a look at understanding how DataFrames can be merged. Knowing how different merge types work allows you to better understand why merges happen in the way that they do. By default, Pandas will use an 'inner'
join to merge data.
There are four main merge types available to you:
- inner join: only merged records where both keys match
- outer join: records from both DataFrames are included, even if some keys don’t match
- left join: all records from the left DataFrame are included, and only matching records from the right DataFrame are included
- right join: all records from the right DataFrame are included, and only matching records from the left DataFrame are included
Let’s see how we can join DataFrames in Pandas!
Understanding the Pandas merge Function
The Pandas merge function provides a huge array of parameters. While you may not always need to use all the different parameters, knowing that they exist and how they work can make your life a lot easier. Let’s see what parameters are available:
# Understanding the Pandas merge() Function
pd.merge(
left, # A DataFrame or Series
right, # A DataFrame or Series
how="inner", # What join type to use
on=None, # What key to use (string or list of strings)
left_on=None, # If keys on the left are different from the right
right_on=None, # If keys on the right are different from the left
left_index=False, # If merging on the index, set to True
right_index=False, # If merging on the index, set to True
sort=True, # Whether to sort the joing keys in lexigraphical order
suffixes=("_x", "_y"), # A tuple of strings to append to overlapping columns
copy=True, # Whether to copy data. Should be set to True.
indicator=False, # Add a column indicating both, left_only, or right_only
validate=None, # Whether to validate a merge type
)
Based on the default arguments, you can see that the left
and right
parameters are the only required ones. Let’s take a look at a simple merge operation after loading some sample DataFrames.
Loading Sample DataFrames
In order to follow along with this tutorial, let’s load three simple DataFrames. Simply copy the code below to load these DataFrames to be able to better follow along with this section of the tutorial:
# Loading Sample DataFrames
books = pd.DataFrame.from_dict({
'Author ID': [1,1,2,3],
'Book ID': [1,2,1,1],
'Name': ['Intro to Python', 'Python 201', 'Data Science', 'Machine Learning']})
authors = pd.DataFrame.from_dict({
'Author ID': [1,2,3,4],
'Name': ['Nik', 'Kate', 'Jane', 'Evan']})
sales = pd.DataFrame.from_dict({
'Author ID': [1,1,1,2,3,4],
'Book ID': [1,2,1,1,1,1],
'Sales': [10, 20, 10, 30, 45, 10]})
print('DataFrame books looks like:')
print(books.head(2))
print('\nDataFrame authors looks like:')
print(authors.head(2))
print('\nDataFrame sales looks like:')
print(sales.head(2))
# Returns:
# Author ID Book ID Name
# 0 1 1 Intro to Python
# 1 1 2 Python 201
# DataFrame authors looks like:
# Author ID Name
# 0 1 Nik
# 1 2 Kate
# DataFrame sales looks like:
# Author ID Book ID Sales
# 0 1 1 10
# 1 1 2 20
Let’s now start looking at how we can easily merge data in Pandas.
Merging Pandas DataFrames on a Single Column
The simpest type of merge we can do is to merge on a single column. Let’s take a look at how we can merge the books
DataFrame and the authors
DataFrame. To gain the author’s name, we merge the DataFrames based on the author’s ID. Let’s see how we can do this with mostly the default arguments.
# Merging DataFrames based on a Single Column
merged = pd.merge(
left=books,
right=authors,
left_on='Author ID',
right_on='Author ID'
)
print(merged)
# Returns:
# Author ID Book ID Name_x Author ID Name_y
# 0 1 1 Intro to Python 1 Nik
# 1 1 2 Python 201 1 Nik
# 2 2 1 Data Science 2 Kate
# 3 3 1 Machine Learning 3 Jane
By default, Pandas will complete an inner join. This means that only records where the left_on
key and right_on
key exist in both datasets will be joined. Because of this, the author with an ID of 4 is not merged into the dataset.
Modifying the Pandas Merge Type
If we changed the how=
parameter to 'outer'
, then all records from both datasets would be included:
# Changing the merge type
merged = pd.merge(
left=books,
right=authors,
left_on='Author ID',
right_on='Author ID',
how='outer'
)
print(merged)
# Returns:
# Author ID Book ID Name_x Author ID Name_y
# 0 1.0 1.0 Intro to Python 1 Nik
# 1 1.0 2.0 Python 201 1 Nik
# 2 2.0 1.0 Data Science 2 Kate
# 3 3.0 1.0 Machine Learning 3 Jane
# 4 NaN NaN NaN 4 Evan
You can see that now all records from both DataFrames have been included. Any values that don’t exist in one dataset are shown as missing NaN
values.
Modifying Duplicate Name Suffixes in Pandas Merge
There are two columns with the same names. Because Pandas DataFrames can’t have columns with the same names, the merge()
function appends suffixes to these columns. By default, Pandas uses ('_x', '_y')
to differentiate the columns. You can change these by making use of the suffixes=
parameter to modify the suffixes. The parameter requires a sequence, such as a list or a tuple, of length 2.
# Changing the suffixes of duplicate columns
merged = pd.merge(
left=books,
right=authors,
left_on='Author ID',
right_on='Author ID',
suffixes=(' (book)', ' (author)')
)
print(merged)
# Returns:
# Author ID Book ID Name (book) Author ID Name (author)
# 0 1 1 Intro to Python 1 Nik
# 1 1 2 Python 201 1 Nik
# 2 2 1 Data Science 2 Kate
# 3 3 1 Machine Learning 3 Jane
Modifying the suffixes can be helpful in order to make your DataFrame easier to interpret.
Merging Pandas DataFrames on Multiple Columns
There may be times when you need to merge Pandas DataFrames on multiple columns. In SQL, this is referred to as using a composite key. In our books
and sales
DataFrames, there is no single column that uniquely identifies a record. Instead, the combination of Author ID
and Book ID
uniquely identify each record in these DataFrames.
Let’s see how we can merge these two DataFrames based on the compound key. In order to do this, we can simply pass in a list of the columns to use. In this case, because the columns are the same in both DataFrames, you can use the on=
parameter, rather than specifying which columns to use from which DataFrame.
# Merging in Pandas based on a Compound Key
merged = pd.merge(
left=books,
right=sales,
on=['Author ID', 'Book ID']
)
print(merged)
# Returns:
# Author ID Book ID Name Sales
# 0 1 1 Intro to Python 10
# 1 1 1 Intro to Python 10
# 2 1 2 Python 201 20
# 3 2 1 Data Science 30
# 4 3 1 Machine Learning 45
Differences Between Pandas merge and join
The Pandas merge()
function is a module function, meaning it is called as a function. Pandas also comes with a class instance method, .join()
. The method is called on one DataFrame and is used to join another. Let’s see what parameters you have available to you:
# Understanding the .join() Method
df.join(
other,
on=None,
how='left',
lsuffix='',
rsuffix='',
sort=False
)
The method allows you to pass in another DataFrame, define merge keys, and define the merge method. Let’s replicate our above merge using this method:
# Using .join to Join Two DataFrames
joined = books.join(authors.set_index(‘Author ID’), on=’Author ID’,
lsuffix=’_books’, rsuffix=’_authors’)
print(joined)
# Returns:
# Author ID Book ID Name_books ID Name_authors
# 0 1 1 Intro to Python 2 Kate
# 1 1 2 Python 201 2 Kate
# 2 2 1 Data Science 3 Jane
# 3 3 1 Machine Learning 4 Evan
Exercises
It’s time to check your learning! Try to solve the exercises below. If you need a hint or want to check your solution simply toggle the question to view a solution.
True or False: In this tutorial, concatenation was explained as lengthening a DataFrame. Merging was explained as widening a DataFrame. Concatenation can also be used to widen a DataFrame?
True – by using a different axis.
When merging, when would you use on= rather than left_on= and right_on=?
on=
can be used to simplify the process of identifying merge columns. If the columns are named the same in both DataFrames, on=
can be used.
Read the documentation on the merge function. Why would you want to use the validate= parameter?
The validate=
parameter allows you to verify the specific type of merge being applied. For example, you can ensure that the merge is of type '1:1
, meaning that for every record on the left, there is only a single corresponding record on the right.
Conclusion and Recap
In this tutorial, you learned how to concatenate, merge and join Pandas DataFrames. These operations allow you to work with different data sources, such as those coming from SQL databases. The section below provides a recap of what you’ve learned:
- Combining data can taken many different forms. You can merge data based on record keys or based on attribute keys.
- Concatenating DataFrames can be done using the
concat()
function or the.append()
method. The.append()
method is simply a convenience function allowing you to concatenate a single DataFrame to antoher. Meanwhile, theconcat()
function allows you to combine multiple DataFrames in one go. - The Pandas
merge()
function allows you to take on database-style merges. The.join()
method is an instance method that simplifies the merging process.
Additional Resources
To learn about related topics, check out the articles below:
very good , except your author key in the merge sampes is wrong
it should be ‘Author ID’ and not ‘ID’ !!!
Thanks so much, Tom! I have fixed those typos :).
Hi Nik, thanks for your great tutorials, they are very helpful! Only in this one I got very confused with your last .join() example, bcs. it gives a different result than the .merge() example, and matches the incorrect authors to the books. After a lot of searching I found out why: this is bcs. the df1.join(df2) method will always join on the indices of both DataFrames. You CAN specify a column for df1 as the join key, but not for df2, it will always use the df2 index as join key. This explains the incorrect results in your example above. A work-around you can do however is to reset the index for df2 with the df2 column that you want to use as join key. So the code in the above example that will give the correct result is:
joined2 = books.join(authors.set_index(‘Author ID’), on=’Author ID’,
lsuffix=’_books’, rsuffix=’_authors’)
print(joined2)
Thank you so much, Mike, for your thoughtful comment! I have made a note to fix this as soon as possible. I appreciate you letting me know :).