In this post, you’ll learn how to split a Pandas dataframe in different ways. You’ll learn how to split a Pandas dataframe by column value, how to split a Pandas dataframe by position, and how to split a Pandas dataframe by random values.
Knowing how to split a Pandas dataframe is a useful skill in many applications: machine learning to select random values, splitting data into specific records for report automation, etc.
Table of Contents
Loading a Sample Dataframe
If you want to follow along with the tutorial, feel free to load the sample Pandas dataframe provided below. If you have your own data, feel free to use that as well, though some specific examples may need additional tailoring for your context.
Let’s get started and load some data!
import pandas as pd
df = pd.DataFrame.from_dict(
{
'Name': ['Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen', 'Jenny', 'Matt', 'Kristen'],
'Year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
'Income': [10000, 11000, 9000, 12000, 13000, 11000, 14000, 15000, 13000, 12000, 14000, 13000],
'Gender': ['F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F', 'F', 'M', 'F']
}
)
print(df)
Printing out the dataframe returns the following:
Name Year Income Gender
0 Jenny 2020 10000 F
1 Matt 2021 11000 M
2 Kristen 2022 9000 F
3 Jenny 2020 12000 F
4 Matt 2021 13000 M
5 Kristen 2022 11000 F
6 Jenny 2020 14000 F
7 Matt 2021 15000 M
8 Kristen 2022 13000 F
9 Jenny 2020 12000 F
10 Matt 2021 14000 M
We can see that we have four columns: Name, Year, Income, and Gender. We have the incomes for three people for three years, and their genders. Let’s see how we can split this dataframe!
Split a Pandas Dataframe by Column Value
Splitting a dataframe by column value is a very helpful skill to know. It can help with automating reporting or being able to parse out different values of a dataframe.
The way that you’ll learn to split a dataframe by its column values is by using the .groupby()
method. I have covered this method quite a bit in this video tutorial:
Let’ see how we can split the dataframe by the Name column:
grouped = df.groupby(df['Name'])
print(grouped.get_group('Jenny'))
What we have done here is:
- Created a
group by
object calledgrouped
, splitting the dataframe by the Name column, - Used the
.get_group()
method to get the dataframe’s rows that contain ‘Jenny’
Get All Groups of a Dataframe by Value
The method you learned above is helpful for pulling out a data if you know the group you want to get.
Now you’ll learn how to split the dataframe into all its possible groupings.
We’ll begin by getting all the different values in a column. We’ll do this by using the Pandas .unique()
method, which you can learn about in detail in my post here. If you’d like to learn about it in video form, check out my video below:
In the example below, we loop over the dataframe and save the unique records to a new Excel file.
grouped = df.groupby(df['Name'])
path_to_directory = '/Users/datagy/'
for name in df['Name'].unique():
temporary_df = grouped.get_group(name)
temporary_df.to_excel(f'{path_to_directory}name.xlsx')
What we’ve done here is looped over the dataframe’s unique values in the Name column, received the group of each name, and saved it to an Excel file.
Split a Pandas Dataframe by Position
In this section, you’ll learn how to split a Pandas dataframe by a position in the dataframe. For example, how to split a dataframe in half or into thirds.
We can accomplish this very easily using the pandas .iloc
accessor. This powerful tool makes it very easy to access data within Pandas. To learn more about it, check out my in-depth tutorial here.
The important thing here is to access the rows you want to access. We can do this using the built-in len()
function to determine how long a dataframe is. You can learn six different method to figuring out how long a dataframe is using my tutorial here.
Let’s say we wanted to split a Pandas dataframe in half. We would split row-wise at the mid-point.
The way that we can find the midpoint of a dataframe is by finding the dataframe’s length and dividing it by two. Once we know the length, we can split the dataframe using the .iloc
accessor.
>>> half_df = len(df) // 2
>>> first_half = df.iloc[:half_df,]
>>> print(first_half)
Name Year Income Gender
0 Jenny 2020 10000 F
1 Matt 2021 11000 M
2 Kristen 2022 9000 F
3 Jenny 2020 12000 F
4 Matt 2021 13000 M
5 Kristen 2022 11000 F
We can see here that the dataframe has returned the first half of the values.
Let’s see how we can turn this into a function split the dataframe into multiple sections:
def split_dataframe_by_position(df, splits):
"""
Takes a dataframe and an integer of the number of splits to create.
Returns a list of dataframes.
"""
dataframes = []
index_to_split = len(df) // splits
start = 0
end = index_to_split
for split in range(splits):
temporary_df = df.iloc[start:end, :]
dataframes.append(temporary_df)
start += index_to_split
end += index_to_split
return dataframes
split_dataframes = split_dataframe_by_position(df, 3)
print(split_dataframes[1])
What this returns is the 2nd third of the original dataframe:
Name Year Income Gender
4 Matt 2021 13000 M
5 Kristen 2022 11000 F
6 Jenny 2020 14000 F
7 Matt 2021 15000 M
Let’s explore what the function actually does:
- We instantiate a list called
dataframes
, which will hold the resulting dataframes - We determine how many rows each dataframe will hold and assign that value to
index_to_split
- We then assign
start
the value of 0 andend
the first value fromindex_to_split
- Finally, we loop over the range of dataframes to split into, selecting data from 0 to that first index
- The resulting dataframe is added to the list
dataframes
and the values ofstart
andend
are increased. - Finally, the list of dataframes is returned and can be accessed
Since lists are 0-indexed, you can access the first dataframe by accessing the lists 0th value.
As our final example, let’s see how we can split a dataframe into random rows.
Split a Pandas Dataframe into Random Values
We can also select a random selection of rows from a dataframe. Pandas comes with a very helpful .sample()
method that allows you to select either a number of records to select or a fraction of rows to select. This can be very helpful if you don’t care what rows you’re returning, but want to split the dataframe.
Let’s see how we can split a dataframe in half using Pandas .sample()
:
>>> half_df = df.sample(frac=0.5)
>>> print(half_df)
Name Year Income Gender
9 Jenny 2020 12000 F
0 Jenny 2020 10000 F
7 Matt 2021 15000 M
1 Matt 2021 11000 M
6 Jenny 2020 14000 F
2 Kristen 2022 9000 F
We can see here that the dataframe has returned a random selection of rows.
Now, let’s see how we can return just a number of rows using the Pandas .sample()
method:
>>> df_3 = df.sample(n=3)
>>> print(df_3)
Name Year Income Gender
9 Jenny 2020 12000 F
11 Kristen 2022 13000 F
0 Jenny 2020 10000 F
Here we have used the n=
parameter to determine how many rows to return.
Conclusion
In this post, you learned how to split a Pandas dataframe. In particular, you learned how to split a Pandas dataframe by column value, by position, and also by random values.
To learn more about the Pandas .sample()
method, check out the official documentation here.
Can we split a datafram row, and then insert a “\t” into the variable?
Can you give me a bit more detail of what you mean?
Thanks for the great job 👏🏿 You are doing
Thank you so much!