In this post, you’ll learn a number of different ways to sample data in Pandas. Getting a sample of data can be incredibly useful when you’re trying to work with large datasets, to help your analysis run more smoothly. If you sample your data representatively, you can work with a much smaller dataset, thereby making your analysis be able to run much faster, which still getting appropriate results.
In this post, we’ll explore a number of different ways in which you can get samples from your Pandas Dataframe. You’ll learn how to use Pandas to sample your dataframe, creating reproducible samples, weighted samples, and samples with replacements. You’ll also learn how to sample at a constant rate and sample items by conditions. Finally, you’ll learn how to sample only random columns.
The Quick Answer: Use Pandas .sample()
Table of Contents
Loading our Sample Dataframe
For this tutorial, we’ll load a dataset that’s preloaded with Seaborn. If you want to learn more about loading datasets with Seaborn, check out my tutorial here. If you just want to follow along here, run the code below:
import pandas as pd
from seaborn import load_dataset
df = load_dataset('penguins')
print(df.head())
This returns the following dataframe:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
In this code above, we first load Pandas as pd
and then import the load_dataset()
function from the Seaborn library. There we load the penguins
dataset into our dataframe.
Need to check if a key exists in a Python dictionary? Check out this tutorial, which teaches you five different ways of seeing if a key exists in a Python dictionary, including how to return a default value.
Using Pandas Sample to Sample your Dataframe
Pandas provides a very helpful method for, well, sampling data. The method is called using .sample()
and provides a number of helpful parameters that we can apply. Before diving into some examples, let’s take a look at the method in a bit more detail:
DataFrame.sample(
n=None,
frac=None,
replace=False,
weights=None,
random_state=None,
axis=None,
ignore_index=False
)
The parameters give us the following options:
n
– the number of items to samplefrac
– the proportion (out of 1) of items to returnreplace
– whether to sample with replacement (i.e., items can be sampled more than once)weight
– by default, samples are equally weighted. A series indicating weights can be applied. If they do not add to 1, they will be normalized to 1.random_state
– a seed number to produce reproducible resultsaxis
– the axis to sampleignore_index
– whether to relabel the index or not
Let’s take a look at an example. We’ll pull 5% of our records, by passing in frac=0.05
as an argument:
sample = df.sample(frac=0.05)
print(sample)
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 23 Adelie Biscoe 38.2 18.1 185.0 3950.0 Male
# 91 Adelie Dream 41.1 18.1 205.0 4300.0 Male
# 190 Chinstrap Dream 46.9 16.6 192.0 2700.0 Female
# 321 Gentoo Biscoe 55.9 17.0 228.0 5600.0 Male
# 198 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female
# 170 Chinstrap Dream 46.4 18.6 190.0 3450.0 Female
# 232 Gentoo Biscoe 45.5 13.7 214.0 4650.0 Female
# 136 Adelie Dream 35.6 17.5 191.0 3175.0 Female
# 179 Chinstrap Dream 49.5 19.0 200.0 3800.0 Male
# 11 Adelie Torgersen 37.8 17.3 180.0 3700.0 NaN
# 86 Adelie Dream 36.3 19.5 190.0 3800.0 Male
# 249 Gentoo Biscoe 50.0 15.3 220.0 5550.0 Male
# 205 Chinstrap Dream 50.7 19.7 203.0 4050.0 Male
# 92 Adelie Dream 34.0 17.1 185.0 3400.0 Female
# 286 Gentoo Biscoe 46.2 14.4 214.0 4650.0 NaN
# 108 Adelie Biscoe 38.1 17.0 181.0 3175.0 Female
# 299 Gentoo Biscoe 45.2 16.4 223.0 5950.0 Male
We can see here that 5% of the dataframe are sampled. The first column represents the index of the original dataframe. We can see here that the index values are sampled randomly.
Tip: If you didn’t want to include the former index, simply pass in the ignore_index=True
argument, which will reset the index from the original values.
In the next section, you’ll learn how to use Pandas to create a reproducible sample of your data.
Want to learn more about Python for-loops? Check out my in-depth tutorial that takes your from beginner to advanced for-loops user! Want to watch a video instead? Check out my YouTube tutorial here.
Creating a Reproducible Random Sample in Pandas
In your data science journey, you’ll run into many situations where you need to be able to reproduce the results of your analysis. Because of this, when you sample data using Pandas, it can be very helpful to know how to create reproducible results.
In many data science libraries, you’ll find either a seed
or random_state
argument. In the case of the .sample()
method, the argument that allows you to create reproducible results is the random_state=
argument.
In order to make this work, let’s pass in an integer to make our result reproducible. Let’s give this a shot using Python:
# Create a reproducible sample using random_state
sample = df.sample(n = 5, random_state = 1)
print(sample)
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 306 Gentoo Biscoe 43.4 14.4 218.0 4600.0 Female
# 341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
# 291 Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
# 102 Adelie Biscoe 37.7 16.0 183.0 3075.0 Female
# 289 Gentoo Biscoe 50.7 15.0 223.0 5550.0 Male
sample2 = df.sample(n = 5, random_state = 1)
print(sample2)
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 306 Gentoo Biscoe 43.4 14.4 218.0 4600.0 Female
# 341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
# 291 Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
# 102 Adelie Biscoe 37.7 16.0 183.0 3075.0 Female
# 289 Gentoo Biscoe 50.7 15.0 223.0 5550.0 Male
We can see here that by passing in the same value in the random_state=
argument, that the same result is returned.
This allows us to be able to produce a sample one day and have the same results be created another day, making our results and analysis much more reproducible.
In the next section, you’ll learn how to apply weights to the samples of your Pandas Dataframe.
Want to learn how to use the Python zip()
function to iterate over two lists? This tutorial teaches you exactly what the zip()
function does and shows you some creative ways to use the function.
Pandas Weighted Samples
One of the very powerful features of the Pandas .sample()
method is to apply different weights to certain rows, meaning that some rows will have a higher chance of being selected than others.
To get started with this example, let’s take a look at the types of penguins we have in our dataset:
print(df['species'].unique())
# Returns: ['Adelie' 'Chinstrap' 'Gentoo']
Say we wanted to give the Chinstrap species a higher chance of being selected. We could apply weights to these species in another column, using the Pandas .map()
method. To learn more about the .map()
method, check out my in-depth tutorial on mapping values to another column here.
df['weights'] = df['species'].map({'Adelie': 20, 'Chinstrap': 60, 'Gentoo': 20})
sample = df.sample(n=5, weights='weights')
print(sample)
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex weights
# 163 Chinstrap Dream 51.7 20.3 194.0 3775.0 Male 60
# 223 Gentoo Biscoe 50.0 15.2 218.0 5700.0 Male 20
# 118 Adelie Torgersen 35.7 17.0 189.0 3350.0 Female 20
# 217 Chinstrap Dream 49.6 18.2 193.0 3775.0 Male 60
# 160 Chinstrap Dream 46.0 18.9 195.0 4150.0 Female 60
We can see here that the Chinstrap species is selected far more than other species.
Let’s break down what we’ve done here:
- We mapped in a dictionary of weights into the species column, using the Pandas map method
- We then passed our new column into the weights argument as:
weights='weights'
, which instructed Pandas to use the column to assign weights
Some important things to understand about the weights=
argument:
- The values of the weights should add up to 1
- If the values do not add up to 1, then Pandas will normalize them so that they do.
In the next section, you’ll learn how to sample a dataframe with replacements, meaning that items can be chosen more than a single time.
Want to learn how to get a file’s extension in Python? This tutorial will teach you how to use the os and pathlib libraries to do just that!
Pandas Sample with Replacements
Another helpful feature of the Pandas .sample()
method is the ability to sample with replacement, meaning that an item can be sampled more than a single time.
For this, we can use the boolean argument, replace=
. By default, this is set to False
, meaning that items cannot be sampled more than a single time. By setting it to True
, however, the items are placed back into the sampling pile, allowing us to draw them again.
In order to demonstrate this, let’s work with a much smaller dataframe. We’ll filter our dataframe to only be five rows, so that we can see how often each row is sampled:
small_df = df.sample(n=5)
sample = small_df.sample(n=5, replace=True)
print(sample)
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 162 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
# 341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
# 162 Chinstrap Dream 46.6 17.8 193.0 3800.0 Female
# 45 Adelie Dream 39.6 18.8 190.0 4600.0 Male
# 59 Adelie Biscoe 37.6 19.1 194.0 3750.0 Male
Let’s break down what we’ve done here:
- We first returned
small_df
, which contained only five rows from our original dataframe - We then re-sampled our dataframe to return five records. Normally, this would return all five records. However, since we passed in
replace=True
, Pandas was able to select each records more than once. - Because of this, the record
162
was returned twice.
One interesting thing to note about this is that it can actually return a sample that is larger than the original dataset. For example, if we were to set the frac=
argument be 1.2, we would need to set replace=True
, since we’d be returned 120% of the original records.
In the next section, you’ll learn how to sample at a constant rate.
Want to learn how to pretty print a JSON file using Python? Learn three different methods to accomplish this using this in-depth tutorial here.
Pandas Sampling Every nth Item (Sampling at a constant rate)
A popular sampling technique is to sample every nth item, meaning that you’re sampling at a constant rate.
In order to do this, we can use the incredibly useful Pandas .iloc
accessor, which allows us to access items using slice notation. To learn more about .iloc
to select data, check out my tutorial here.
In Python, we can slice data in different ways using slice notation, which follows this pattern:
[start : end : step]
If we wanted to, say, select every 5th record, we could leave the start and end parameters empty (meaning they’d slice from beginning to end) and step over every 5 records.
Let’s see what this would look like:
sample = df[::5]
print(sample.head())
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
# 5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
# 10 Adelie Torgersen 37.8 17.1 186.0 3300.0 NaN
# 15 Adelie Torgersen 36.6 17.8 185.0 3700.0 Female
# 20 Adelie Biscoe 37.8 18.3 174.0 3400.0 Female
Taking a look at the index of our sample
dataframe, we can see that it returns every fifth row. We can set the step counter to be whatever rate we wanted.
In the next section, you’ll learn how to use Pandas to sample items by a given condition.
Want to learn more about calculating the square root in Python? Check out my tutorial here, which will teach you different ways of calculating the square root, both without Python functions and with the help of functions.
Pandas Sampling Items by Conditions
You may also want to sample a Pandas Dataframe using a condition, meaning that you can return all rows the meet (or don’t meet) a certain condition. In order to filter our dataframe using conditions, we use the []
square root indexing method, where we pass a condition into the square roots.
If you want to learn more about how to select items based on conditions, check out my tutorial on selecting data in Pandas.
Say we wanted to filter our dataframe to select only rows where the bill_length_mm
are less than 35.
We can write the following:
condition = df['bill_length_mm'] < 35
sample = df[condition]
print(sample.head())
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 8 Adelie Torgersen 34.1 18.1 193.0 3475.0 NaN
# 14 Adelie Torgersen 34.6 21.1 198.0 4400.0 Male
# 18 Adelie Torgersen 34.4 18.4 184.0 3325.0 Female
# 54 Adelie Biscoe 34.5 18.1 187.0 2900.0 Female
# 70 Adelie Torgersen 33.5 19.0 190.0 3600.0 Female
We can see here that we returned only rows where the bill length was less than 35.
Rather than splitting the condition off onto a separate line, we could also simply combine it to be written as sample = df[df['bill_length_mm'] < 35]
to make our code more concise.
Pandas also comes with a unary operator ~
, which negates an operation. We can use this to sample only rows that don't meet our condition.
Let's see what this would look like:
sample = df[~(df['bill_length_mm'] < 35)]
print(sample.head())
# Returns:
# species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
# 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
# 1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
# 2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
# 3 Adelie Torgersen NaN NaN NaN NaN NaN
# 4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
We can see here that only rows where the bill length is >35 are returned.
In the next section, you'll learn how to sample random columns from a Pandas Dataframe.
Want to learn more about Python f-strings? Check out my in-depth tutorial, which includes a step-by-step video to master Python f-strings!
Pandas Sampling Random Columns
In this final section, you'll learn how to use Pandas to sample random columns of your dataframe. This can be done using the Pandas .sample()
method, by changing the axis=
parameter equal to 1, rather than the default value of 0.
Let's see how we can do this using Pandas and Python:
sample = df.sample(n=3,axis=1)
print(sample.head())
# Returns:
# bill_depth_mm sex bill_length_mm
# 0 18.7 Male 39.1
# 1 17.4 Female 39.5
# 2 18.0 Female 40.3
# 3 NaN NaN NaN
# 4 19.3 Female 36.7
We can see here that we used Pandas to sample 3 random columns from our dataframe. In this case, all rows are returned but we limited the number of columns that we sampled.
Want to learn how to calculate and use the natural logarithm in Python. Check out my tutorial here, which will teach you everything you need to know about how to calculate it in Python.
Conclusion
In this post, you learned all the different ways in which you can sample a Pandas Dataframe. You learned how to use the Pandas .sample()
method, including how to return a set number of rows or a fraction of your dataframe. You also learned how to apply weights to your samples and how to select rows iteratively at a constant rate. You also learned how to sample rows meeting a condition and how to select random columns.
To learn more about sampling, check out this post by Search Business Analytics.
To learn more about the Pandas sample method, check out the official documentation here.
Pingback: Pandas Quantile: Calculate Percentiles of a Dataframe • datagy