Python: Split a Pandas Dataframe

Split a Pandas Dataframe Cover Image

In this post, you’ll learn how to split a Pandas dataframe in different ways. You’ll learn how to split a Pandas dataframe by column value, how to split a Pandas dataframe by position, and how to split a Pandas dataframe by random values.

Knowing how to split a Pandas dataframe is a useful skill in many applications: machine learning to select random values, splitting data into specific records for report automation, etc.

Loading a Sample Dataframe

If you want to follow along with the tutorial, feel free to load the sample Pandas dataframe provided below. If you have your own data, feel free to use that as well, though some specific examples may need additional tailoring for your context.

Let’s get started and load some data!

import pandas as pd
df = pd.DataFrame.from_dict(
    {
        'Name': ['Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen'],
        'Year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
        'Income': [10000, 11000, 9000, 12000, 13000, 11000, 14000, 15000, 13000, 12000, 14000, 13000],
        'Gender': ['F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F']
    }
)
print(df)

Printing out the dataframe returns the following:

       Name  Year  Income Gender
0     Jenny  2020   10000      F
1      Matt  2021   11000      M
2   Kristen  2022    9000      F
3     Jenny  2020   12000      F
4      Matt  2021   13000      M
5   Kristen  2022   11000      F
6     Jenny  2020   14000      F
7      Matt  2021   15000      M
8   Kristen  2022   13000      F
9     Jenny  2020   12000      F
10     Matt  2021   14000      M

We can see that we have four columns: Name, Year, Income, and Gender. We have the incomes for three people for three years, and their genders. Let’s see how we can split this dataframe!

Split a Pandas Dataframe by Column Value

Splitting a dataframe by column value is a very helpful skill to know. It can help with automating reporting or being able to parse out different values of a dataframe.

The way that you’ll learn to split a dataframe by its column values is by using the .groupby() method. I have covered this method quite a bit in this video tutorial:

Let’ see how we can split the dataframe by the Name column:

grouped = df.groupby(df['Name'])
print(grouped.get_group('Jenny'))

What we have done here is:

  1. Created a group by object called grouped, splitting the dataframe by the Name column,
  2. Used the .get_group() method to get the dataframe’s rows that contain ‘Jenny’

Get All Groups of a Dataframe by Value

The method you learned above is helpful for pulling out a data if you know the group you want to get.

Now you’ll learn how to split the dataframe into all its possible groupings.

We’ll begin by getting all the different values in a column. We’ll do this by using the Pandas .unique() method, which you can learn about in detail in my post here. If you’d like to learn about it in video form, check out my video below:

In the example below, we loop over the dataframe and save the unique records to a new Excel file.

grouped = df.groupby(df['Name'])

path_to_directory = '/Users/datagy/'
for name in df['Name'].unique():
    temporary_df = grouped.get_group(name)
    temporary_df.to_excel(f'{path_to_directory}name.xlsx')

What we’ve done here is looped over the dataframe’s unique values in the Name column, received the group of each name, and saved it to an Excel file.

Split a Pandas Dataframe by Position

In this section, you’ll learn how to split a Pandas dataframe by a position in the dataframe. For example, how to split a dataframe in half or into thirds.

We can accomplish this very easily using the pandas .iloc accessor. This powerful tool makes it very easy to access data within Pandas. To learn more about it, check out my in-depth tutorial here.

The important thing here is to access the rows you want to access. We can do this using the built-in len() function to determine how long a dataframe is. You can learn six different method to figuring out how long a dataframe is using my tutorial here.

Let’s say we wanted to split a Pandas dataframe in half. We would split row-wise at the mid-point.

The way that we can find the midpoint of a dataframe is by finding the dataframe’s length and dividing it by two. Once we know the length, we can split the dataframe using the .iloc accessor.

>>> half_df = len(df) // 2
>>> first_half = df.iloc[:half_df,]

>>> print(first_half)

      Name  Year  Income Gender
0    Jenny  2020   10000      F
1     Matt  2021   11000      M
2  Kristen  2022    9000      F
3    Jenny  2020   12000      F
4     Matt  2021   13000      M
5  Kristen  2022   11000      F

We can see here that the dataframe has returned the first half of the values.

Let’s see how we can turn this into a function split the dataframe into multiple sections:

def split_dataframe_by_position(df, splits):
    """
    Takes a dataframe and an integer of the number of splits to create.
    Returns a list of dataframes.
    """
    dataframes = []
    index_to_split = len(df) // splits
    start = 0
    end = index_to_split
    for split in range(splits):
        temporary_df = df.iloc[start:end, :]
        dataframes.append(temporary_df)
        start += index_to_split
        end += index_to_split

    return dataframes

split_dataframes = split_dataframe_by_position(df, 3)
print(split_dataframes[1])

What this returns is the 2nd third of the original dataframe:

      Name  Year  Income Gender
4     Matt  2021   13000      M
5  Kristen  2022   11000      F
6    Jenny  2020   14000      F
7     Matt  2021   15000      M

Let’s explore what the function actually does:

  1. We instantiate a list called dataframes, which will hold the resulting dataframes
  2. We determine how many rows each dataframe will hold and assign that value to index_to_split
  3. We then assign start the value of 0 and end the first value from index_to_split
  4. Finally, we loop over the range of dataframes to split into, selecting data from 0 to that first index
  5. The resulting dataframe is added to the list dataframes and the values of start and end are increased.
  6. Finally, the list of dataframes is returned and can be accessed

Since lists are 0-indexed, you can access the first dataframe by accessing the lists 0th value.

As our final example, let’s see how we can split a dataframe into random rows.

Split a Pandas Dataframe into Random Values

We can also select a random selection of rows from a dataframe. Pandas comes with a very helpful .sample() method that allows you to select either a number of records to select or a fraction of rows to select. This can be very helpful if you don’t care what rows you’re returning, but want to split the dataframe.

Let’s see how we can split a dataframe in half using Pandas .sample():

>>> half_df = df.sample(frac=0.5)
>>> print(half_df)

      Name  Year  Income Gender
9    Jenny  2020   12000      F
0    Jenny  2020   10000      F
7     Matt  2021   15000      M
1     Matt  2021   11000      M
6    Jenny  2020   14000      F
2  Kristen  2022    9000      F

We can see here that the dataframe has returned a random selection of rows.

Now, let’s see how we can return just a number of rows using the Pandas .sample() method:

>>> df_3 = df.sample(n=3)
>>> print(df_3)

       Name  Year  Income Gender
9     Jenny  2020   12000      F
11  Kristen  2022   13000      F
0     Jenny  2020   10000      F

Here we have used the n= parameter to determine how many rows to return.

Conclusion

In this post, you learned how to split a Pandas dataframe. In particular, you learned how to split a Pandas dataframe by column value, by position, and also by random values.

To learn more about the Pandas .sample() method, check out the official documentation here.