Skip to content

Combine Data in Pandas with merge, join, and concat

Combine Data in Pandas with merge, join, and concat Cover image

In this tutorial, you’ll learn how to combine data in Pandas by merging, joining, and concatenating DataFrames. You’ll learn how to perform database-style merging of DataFrames based on common columns or indices using the merge() function and the .join() method. You’ll also learn how to combine datasets by concatenating multiple DataFrames with similar columns.

Different Ways to Combine Data

There are a number of different ways in which you may want to combine data. For example, you can combine datasets by concatenating them. This process involves combining datasets together by including the rows of one dataset underneath the rows of the other. This process will be referred to as concatenating or appending datasets.

There are a number of ways in which you can concatenate datasets. For example, you can require that all datasets have the same columns. On the other hand, you can choose to include any mismatched columns as well, thereby introducing the potential for including missing data.

Generally, the process of concatenating datasets will make your dataset longer, rather than wider. However, if you are comfortable with appending datasets with mismatched columns, the resulting dataset may also grow wider. Concatenating datasets focuses on merging based on columns, rather than based on records. This, is a generalization and is not always true!

You can also join datasets together, to make a larger one. This is what will be referred to as either joining or merging datasets in this tutorial. When you join a dataset with another, you are merging these sets based on a key (or keys).

Generally speaking, the process of merging datasets will focus on making the dataset wider, rather than longer. Depending on the overall between records, however, and the method of merging you choose, you may also introduce more rows. Merging datasets focuses on merging based on the records’ values, rather than based on column headers.

This tutorial is split into two main sections: concatenating DataFrames and merging DataFrames. Let’s begin by learning how to concatenate, or append, multiple DataFrames into a single DataFrame.

Concatenating Pandas DataFrames

By concatenating DataFrames, you stitch together datasets along an axis – either rows or columns. Throughout this tutorial, we’ll explore how to stitch Pandas DataFrames together using column-wise combining. When we concatenate or append DataFrames, we need to make some decisions. These decisions can include determining whether to maintain the original indices, add additional helpful keys, and more.

Loading Sample Pandas DataFrames

To follow along with this tutorial, let’s load a few DataFrames to better understand the nuances of some how to combine Pandas DataFrames. We’ll load three different DataFrames with some overlap:

# Loading Sample Pandas DataFrames
import pandas as pd

df1 = pd.DataFrame.from_dict({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c'], 'col3': ['a1', 'b2', 'c3']})
df2 = pd.DataFrame.from_dict({'col1': [4, 5, 6], 'col2': ['d', 'e', 'f'], 'col3': ['d4', 'e5', 'f6']})
df3 = pd.DataFrame.from_dict({'col1': [7, 8, 9], 'col2': ['g', 'h', 'i'], 'col4': ['g7', 'h2', 'i3']})

print('df1 looks like:')
print(df1)
print('\ndf2 looks like:')
print(df2)
print('\ndf3 looks like:')
print(df3)

# Returns:
# df1 looks like:
#    col1 col2 col3
# 0     1    a   a1
# 1     2    b   b2
# 2     3    c   c3

# df2 looks like:
#    col1 col2 col3
# 0     4    d   d4
# 1     5    e   e5
# 2     6    f   f6

# df3 looks like:
#    col1 col2 col4
# 0     7    g   g7
# 1     8    h   h2
# 2     9    i   i3

We can see that we have three basic DataFrames, each with three rows. The first two DataFrames have columns that overlap in entirety, while the third has a column that doesn’t exist in the first two.

Understanding the Pandas concat Function

Let’s take a look at the Pandas concat() function, which can be used to combine DataFrames together. The function looks like this:

# Understanding the Pandas concat() function
import pandas as pd
pd.concat(
    objs,                   # The Series or DataFrame objects to concatenate
    axis=0,                 # The axis to concatenate along
    join='outer',           # How to handle additional data elements
    ignore_index=False,     # Whether to maintain the original index or not
    keys=None,              # when multiple levels are passed
    levels=None,            # Levels to use to construct a multi-index
    names=None,             # Names for levels to use in multi-index
    verify_integrity=False, # Check if the new axis contains duplicates
    sort=False,             # Sort the non-concatenation axis, if columns aren't aligned
    copy=True               # If False, do not copy data unnecessarily
)

The only parameter that doesn’t have a default argument is the objs parameter. This contains a list of DataFrames or Series that you want to concatenate. The function returns either a Series or a DataFrame, depending on what is passed in.

The method gives a ton of different parameters to help customize how the data will be concatenated. In many cases, you’ll get by even if you don’t fully understand all of them. However, it’s good to know that they’re there and what they do in case you do need them for your use case.

Concatenating DataFrames with Pandas concat

Let’s start by looking at the most straightforward example: concatenating two DataFrames with the same columns. Since the function only requires you to pass in the objects you want to concatenate, you can simply pass in the list of objects. In this case, we can pass in [df1, df2]. Let’s see what this results in:

# Concatenating simple DataFrames
df_concat = pd.concat([df1, df2])
print(df_concat)

# Returns:
#    col1 col2 col3
# 0     1    a   a1
# 1     2    b   b2
# 2     3    c   c3
# 0     4    d   d4
# 1     5    e   e5
# 2     6    f   f6

Because we didn’t pass in any arguments, the DataFrames were concatenated as they were. Because of this, the original index values were included. Let’s see how you can use the ignore_index= parameter to not preserve the original index from different DataFrames. This results in the indices being relabeled from 0 through to the end of the length.

# Ignoring an index when appending
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)

# Returns:
#    col1 col2 col3
# 0     1    a   a1
# 1     2    b   b2
# 2     3    c   c3
# 3     4    d   d4
# 4     5    e   e5
# 5     6    f   f6

This can be a much cleaner approach when your index is simply the row label.

Concatenating DataFrames with Mixed Columns

In the example above, all the columns between the two DataFrames matched exactly. This is not always the case. In this section, you’ll learn how concatenating two DataFrames with different columns will work. Let’s give this a shot using [df1, df3] as our objects.

# Concatenating DataFrames with Different Columns
df_concat = pd.concat([df1, df3], ignore_index=True)
print(df_concat)

# Returns:
#    col1 col2 col3 col4
# 0     1    a   a1  NaN
# 1     2    b   b2  NaN
# 2     3    c   c3  NaN
# 3     7    g  NaN   g7
# 4     8    h  NaN   h2
# 5     9    i  NaN   i3

In the example above, the two DataFrames were concatenated. However, since there were columns that didn’t exist in both DataFrames, the columns were added with missing values.

The reason these values were included is because the default argument for the join= parameter is 'outer'. Because of this, all values across the objects are maintained. The other potential argument for this is 'inner', which will only concatenate the objects with overlapping matches. Let’s see what this would return:

# Changing to an inner concatenation
df_concat = pd.concat([df1, df3], ignore_index=True, join='inner')
print(df_concat)

# Returns:
#    col1 col2
# 0     1    a
# 1     2    b
# 2     3    c
# 3     7    g
# 4     8    h
# 5     9    i

Because the data is joined using the column-axis, the columns that do not overlap between all the objects are dropped. In this case, only col1 and col2 are included in the resulting DataFrame.

Concatenating Multiple DataFrames at Once

The Pandas concat function also allows you to concatenate more than two DataFrames at a single time. This can be incredibly helpful when you want to append multiple DataFrames. Let’s see how we can pass in all the DataFrames we currently have:

# Concatenating Multiple DataFrames
df_concat = pd.concat([df1, df2, df3], ignore_index=True)
print(df_concat)

# Returns:
#    col1 col2 col3 col4
# 0     1    a   a1  NaN
# 1     2    b   b2  NaN
# 2     3    c   c3  NaN
# 3     4    d   d4  NaN
# 4     5    e   e5  NaN
# 5     6    f   f6  NaN
# 6     7    g  NaN   g7
# 7     8    h  NaN   h2
# 8     9    i  NaN   i3

This lets you easily combine multiple DataFrames.

Concatenating DataFrames with the append

Another method that you have available is the Pandas .append() method. When applied to a DataFrame, you can pass in another DataFrame to append it. The method is a shortcut to the concat() function, which gives you significant flexibility.

Because this method is a convenience method, the method can only append one DataFrame to another. In order to append multiple DataFrames, you do need to use the concat() function.

Let’s see how we can append df1 and df2 using the .append() method:

# Appending Two DataFrames with .append()
df_append = df1.append(df2, ignore_index=True)
print(df_append)

# Returns:
#    col1 col2 col3
# 0     1    a   a1
# 1     2    b   b2
# 2     3    c   c3
# 3     4    d   d4
# 4     5    e   e5
# 5     6    f   f6

The method doesn’t take nearly as many parameters of the concat() function. However, one parameter to be aware of is the sort= parameter. This defaults to False, but if toggled to True, then Pandas will first sort the columns in the DataFrames to help them be in the same order.

Merging Data with Pandas merge

Pandas handles database-like joining operations with great flexibility. While, on the surface, the function works quite elegantly, there is a lot of flexibility under the hood. For example, you can complete many different merge types (such as inner, outer, left, and right) and merge on a single key or multiple keys.

Understanding Different Merge Types

Let’s take a look at understanding how DataFrames can be merged. Knowing how different merge types work allows you to better understand why merges happen in the way that they do. By default, Pandas will use an 'inner' join to merge data.

How different merge types work

There are four main merge types available to you:

  • inner join: only merged records where both keys match
  • outer join: records from both DataFrames are included, even if some keys don’t match
  • left join: all records from the left DataFrame are included, and only matching records from the right DataFrame are included
  • right join: all records from the right DataFrame are included, and only matching records from the left DataFrame are included

Let’s see how we can join DataFrames in Pandas!

Understanding the Pandas merge Function

The Pandas merge function provides a huge array of parameters. While you may not always need to use all the different parameters, knowing that they exist and how they work can make your life a lot easier. Let’s see what parameters are available:

# Understanding the Pandas merge() Function
pd.merge(
    left,                   # A DataFrame or Series
    right,                  # A DataFrame or Series
    how="inner",            # What join type to use
    on=None,                # What key to use (string or list of strings)
    left_on=None,           # If keys on the left are different from the right
    right_on=None,          # If keys on the right are different from the left
    left_index=False,       # If merging on the index, set to True
    right_index=False,      # If merging on the index, set to True
    sort=True,              # Whether to sort the joing keys in lexigraphical order
    suffixes=("_x", "_y"),  # A tuple of strings to append to overlapping columns
    copy=True,              # Whether to copy data. Should be set to True.
    indicator=False,        # Add a column indicating both, left_only, or right_only
    validate=None,          # Whether to validate a merge type
)

Based on the default arguments, you can see that the left and right parameters are the only required ones. Let’s take a look at a simple merge operation after loading some sample DataFrames.

Loading Sample DataFrames

In order to follow along with this tutorial, let’s load three simple DataFrames. Simply copy the code below to load these DataFrames to be able to better follow along with this section of the tutorial:

# Loading Sample DataFrames
books = pd.DataFrame.from_dict({
    'Author ID': [1,1,2,3],
    'Book ID': [1,2,1,1],
    'Name': ['Intro to Python', 'Python 201', 'Data Science', 'Machine Learning']})

authors = pd.DataFrame.from_dict({
    'Author ID': [1,2,3,4],
    'Name': ['Nik', 'Kate', 'Jane', 'Evan']})

sales = pd.DataFrame.from_dict({
    'Author ID': [1,1,1,2,3,4],
    'Book ID': [1,2,1,1,1,1],
    'Sales': [10, 20, 10, 30, 45, 10]})

print('DataFrame books looks like:')
print(books.head(2))
print('\nDataFrame authors looks like:')
print(authors.head(2))
print('\nDataFrame sales looks like:')
print(sales.head(2))

# Returns:
#    Author ID  Book ID             Name
# 0          1        1  Intro to Python
# 1          1        2       Python 201

# DataFrame authors looks like:
#    Author ID  Name
# 0          1   Nik
# 1          2  Kate

# DataFrame sales looks like:
#    Author ID  Book ID  Sales
# 0          1        1     10
# 1          1        2     20

Let’s now start looking at how we can easily merge data in Pandas.

Merging Pandas DataFrames on a Single Column

The simpest type of merge we can do is to merge on a single column. Let’s take a look at how we can merge the books DataFrame and the authors DataFrame. To gain the author’s name, we merge the DataFrames based on the author’s ID. Let’s see how we can do this with mostly the default arguments.

# Merging DataFrames based on a Single Column
merged = pd.merge(
    left=books,
    right=authors,
    left_on='Author ID',
    right_on='Author ID'
)

print(merged)

# Returns:
#    Author ID  Book ID            Name_x  Author ID Name_y
# 0          1        1   Intro to Python          1    Nik
# 1          1        2        Python 201          1    Nik
# 2          2        1      Data Science          2   Kate
# 3          3        1  Machine Learning          3   Jane

By default, Pandas will complete an inner join. This means that only records where the left_on key and right_on key exist in both datasets will be joined. Because of this, the author with an ID of 4 is not merged into the dataset.

Modifying the Pandas Merge Type

If we changed the how= parameter to 'outer', then all records from both datasets would be included:

# Changing the merge type
merged = pd.merge(
    left=books,
    right=authors,
    left_on='Author ID',
    right_on='Author ID',
    how='outer'
)

print(merged)

# Returns:
#    Author ID  Book ID            Name_x  Author ID Name_y
# 0        1.0      1.0   Intro to Python          1    Nik
# 1        1.0      2.0        Python 201          1    Nik
# 2        2.0      1.0      Data Science          2   Kate
# 3        3.0      1.0  Machine Learning          3   Jane
# 4        NaN      NaN               NaN          4   Evan

You can see that now all records from both DataFrames have been included. Any values that don’t exist in one dataset are shown as missing NaN values.

Modifying Duplicate Name Suffixes in Pandas Merge

There are two columns with the same names. Because Pandas DataFrames can’t have columns with the same names, the merge() function appends suffixes to these columns. By default, Pandas uses ('_x', '_y') to differentiate the columns. You can change these by making use of the suffixes= parameter to modify the suffixes. The parameter requires a sequence, such as a list or a tuple, of length 2.

# Changing the suffixes of duplicate columns
merged = pd.merge(
    left=books,
    right=authors,
    left_on='Author ID',
    right_on='Author ID',
    suffixes=(' (book)', ' (author)')
)

print(merged)

# Returns:
#    Author ID  Book ID       Name (book)  Author ID Name (author)
# 0          1        1   Intro to Python          1           Nik
# 1          1        2        Python 201          1           Nik
# 2          2        1      Data Science          2          Kate
# 3          3        1  Machine Learning          3          Jane

Modifying the suffixes can be helpful in order to make your DataFrame easier to interpret.

Merging Pandas DataFrames on Multiple Columns

There may be times when you need to merge Pandas DataFrames on multiple columns. In SQL, this is referred to as using a composite key. In our books and sales DataFrames, there is no single column that uniquely identifies a record. Instead, the combination of Author ID and Book ID uniquely identify each record in these DataFrames.

Let’s see how we can merge these two DataFrames based on the compound key. In order to do this, we can simply pass in a list of the columns to use. In this case, because the columns are the same in both DataFrames, you can use the on= parameter, rather than specifying which columns to use from which DataFrame.

# Merging in Pandas based on a Compound Key
merged = pd.merge(
    left=books,
    right=sales,
    on=['Author ID', 'Book ID']
)

print(merged)

# Returns:
#    Author ID  Book ID              Name  Sales
# 0          1        1   Intro to Python     10
# 1          1        1   Intro to Python     10
# 2          1        2        Python 201     20
# 3          2        1      Data Science     30
# 4          3        1  Machine Learning     45

Differences Between Pandas merge and join

The Pandas merge() function is a module function, meaning it is called as a function. Pandas also comes with a class instance method, .join(). The method is called on one DataFrame and is used to join another. Let’s see what parameters you have available to you:

# Understanding the .join() Method
df.join(
    other, 
    on=None, 
    how='left', 
    lsuffix='', 
    rsuffix='', 
    sort=False
)

The method allows you to pass in another DataFrame, define merge keys, and define the merge method. Let’s replicate our above merge using this method:

# Using .join to Join Two DataFrames
joined = books.join(authors.set_index(‘Author ID’), on=’Author ID’,
lsuffix=’_books’, rsuffix=’_authors’)
print(joined)

# Returns:
#    Author ID  Book ID        Name_books  ID Name_authors
# 0          1        1   Intro to Python   2         Kate
# 1          1        2        Python 201   2         Kate
# 2          2        1      Data Science   3         Jane
# 3          3        1  Machine Learning   4         Evan

Exercises

It’s time to check your learning! Try to solve the exercises below. If you need a hint or want to check your solution simply toggle the question to view a solution.

True or False: In this tutorial, concatenation was explained as lengthening a DataFrame. Merging was explained as widening a DataFrame. Concatenation can also be used to widen a DataFrame?

True – by using a different axis.

When merging, when would you use on= rather than left_on= and right_on=?

on= can be used to simplify the process of identifying merge columns. If the columns are named the same in both DataFrames, on= can be used.

Read the documentation on the merge function. Why would you want to use the validate= parameter?

The validate= parameter allows you to verify the specific type of merge being applied. For example, you can ensure that the merge is of type '1:1, meaning that for every record on the left, there is only a single corresponding record on the right.

Conclusion and Recap

In this tutorial, you learned how to concatenate, merge and join Pandas DataFrames. These operations allow you to work with different data sources, such as those coming from SQL databases. The section below provides a recap of what you’ve learned:

  • Combining data can taken many different forms. You can merge data based on record keys or based on attribute keys.
  • Concatenating DataFrames can be done using the concat() function or the .append() method. The .append() method is simply a convenience function allowing you to concatenate a single DataFrame to antoher. Meanwhile, the concat() function allows you to combine multiple DataFrames in one go.
  • The Pandas merge() function allows you to take on database-style merges. The .join() method is an instance method that simplifies the merging process.

Additional Resources

To learn about related topics, check out the articles below:

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

4 thoughts on “Combine Data in Pandas with merge, join, and concat”

  1. Hi Nik, thanks for your great tutorials, they are very helpful! Only in this one I got very confused with your last .join() example, bcs. it gives a different result than the .merge() example, and matches the incorrect authors to the books. After a lot of searching I found out why: this is bcs. the df1.join(df2) method will always join on the indices of both DataFrames. You CAN specify a column for df1 as the join key, but not for df2, it will always use the df2 index as join key. This explains the incorrect results in your example above. A work-around you can do however is to reset the index for df2 with the df2 column that you want to use as join key. So the code in the above example that will give the correct result is:
    joined2 = books.join(authors.set_index(‘Author ID’), on=’Author ID’,
    lsuffix=’_books’, rsuffix=’_authors’)
    print(joined2)

Leave a Reply

Your email address will not be published. Required fields are marked *