7 Ways to Sample Data in Pandas

Pandas Sample Dataframe Cover Image

In this post, you’ll learn a number of different ways to sample data in Pandas. Getting a sample of data can be incredibly useful when you’re trying to work with large datasets, to help your analysis run more smoothly. If you sample your data representatively, you can work with a much smaller dataset, thereby making your analysis be able to run much faster, which still getting appropriate results.

In this post, we’ll explore a number of different ways in which you can get samples from your Pandas Dataframe. You’ll learn how to use Pandas to sample your dataframe, creating reproducible samples, weighted samples, and samples with replacements. You’ll also learn how to sample at a constant rate and sample items by conditions. Finally, you’ll learn how to sample only random columns.

The Quick Answer: Use Pandas .sample()

Quick Answer - Pandas Sample Dataframe

Loading our Sample Dataframe

For this tutorial, we’ll load a dataset that’s preloaded with Seaborn. If you want to learn more about loading datasets with Seaborn, check out my tutorial here. If you just want to follow along here, run the code below:

from numpy.lib.npyio import load
import pandas as pd
from seaborn import load_dataset

df = load_dataset('penguins')

print(df.head())

This returns the following dataframe:

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

In this code above, we first load Pandas as pd and then import the load_dataset() function from the Seaborn library. There we load the penguins dataset into our dataframe.

Need to check if a key exists in a Python dictionary? Check out this tutorial, which teaches you five different ways of seeing if a key exists in a Python dictionary, including how to return a default value.

Using Pandas Sample to Sample your Dataframe

Pandas provides a very helpful method for, well, sampling data. The method is called using .sample() and provides a number of helpful parameters that we can apply. Before diving into some examples, let’s take a look at the method in a bit more detail:

DataFrame.sample(
    n=None, 
    frac=None, 
    replace=False, 
    weights=None, 
    random_state=None, 
    axis=None, 
    ignore_index=False
)

The parameters give us the following options:

  • n – the number of items to sample
  • frac – the proportion (out of 1) of items to return
  • replace – whether to sample with replacement (i.e., items can be sampled more than once)
  • weight – by default, samples are equally weighted. A series indicating weights can be applied. If they do not add to 1, they will be normalized to 1.
  • random_state – a seed number to produce reproducible results
  • axis – the axis to sample
  • ignore_index – whether to relabel the index or not

Let’s take a look at an example. We’ll pull 5% of our records, by passing in frac=0.05 as an argument:

sample = df.sample(frac=0.05)
print(sample)

# Returns:
#        species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 23      Adelie     Biscoe            38.2           18.1              185.0       3950.0    Male
# 91      Adelie      Dream            41.1           18.1              205.0       4300.0    Male
# 190  Chinstrap      Dream            46.9           16.6              192.0       2700.0  Female
# 321     Gentoo     Biscoe            55.9           17.0              228.0       5600.0    Male
# 198  Chinstrap      Dream            50.1           17.9              190.0       3400.0  Female
# 170  Chinstrap      Dream            46.4           18.6              190.0       3450.0  Female
# 232     Gentoo     Biscoe            45.5           13.7              214.0       4650.0  Female
# 136     Adelie      Dream            35.6           17.5              191.0       3175.0  Female
# 179  Chinstrap      Dream            49.5           19.0              200.0       3800.0    Male
# 11      Adelie  Torgersen            37.8           17.3              180.0       3700.0     NaN
# 86      Adelie      Dream            36.3           19.5              190.0       3800.0    Male
# 249     Gentoo     Biscoe            50.0           15.3              220.0       5550.0    Male
# 205  Chinstrap      Dream            50.7           19.7              203.0       4050.0    Male
# 92      Adelie      Dream            34.0           17.1              185.0       3400.0  Female
# 286     Gentoo     Biscoe            46.2           14.4              214.0       4650.0     NaN
# 108     Adelie     Biscoe            38.1           17.0              181.0       3175.0  Female
# 299     Gentoo     Biscoe            45.2           16.4              223.0       5950.0    Male

We can see here that 5% of the dataframe are sampled. The first column represents the index of the original dataframe. We can see here that the index values are sampled randomly.

Tip: If you didn’t want to include the former index, simply pass in the ignore_index=True argument, which will reset the index from the original values.

In the next section, you’ll learn how to use Pandas to create a reproducible sample of your data.

Want to learn more about Python for-loops? Check out my in-depth tutorial that takes your from beginner to advanced for-loops user! Want to watch a video instead? Check out my YouTube tutorial here.

Creating a Reproducible Random Sample in Pandas

In your data science journey, you’ll run into many situations where you need to be able to reproduce the results of your analysis. Because of this, when you sample data using Pandas, it can be very helpful to know how to create reproducible results.

In many data science libraries, you’ll find either a seed or random_state argument. In the case of the .sample() method, the argument that allows you to create reproducible results is the random_state= argument.

In order to make this work, let’s pass in an integer to make our result reproducible. Let’s give this a shot using Python:

# Create a reproducible sample using random_state
sample = df.sample(n = 5, random_state = 1)
print(sample)

# Returns:
#     species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 306  Gentoo  Biscoe            43.4           14.4              218.0       4600.0  Female
# 341  Gentoo  Biscoe            50.4           15.7              222.0       5750.0    Male
# 291  Gentoo  Biscoe            46.4           15.6              221.0       5000.0    Male
# 102  Adelie  Biscoe            37.7           16.0              183.0       3075.0  Female
# 289  Gentoo  Biscoe            50.7           15.0              223.0       5550.0    Male

sample2 = df.sample(n = 5, random_state = 1)
print(sample2)
# Returns:
#     species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 306  Gentoo  Biscoe            43.4           14.4              218.0       4600.0  Female
# 341  Gentoo  Biscoe            50.4           15.7              222.0       5750.0    Male
# 291  Gentoo  Biscoe            46.4           15.6              221.0       5000.0    Male
# 102  Adelie  Biscoe            37.7           16.0              183.0       3075.0  Female
# 289  Gentoo  Biscoe            50.7           15.0              223.0       5550.0    Male

We can see here that by passing in the same value in the random_state= argument, that the same result is returned.

This allows us to be able to produce a sample one day and have the same results be created another day, making our results and analysis much more reproducible.

In the next section, you’ll learn how to apply weights to the samples of your Pandas Dataframe.

Want to learn how to use the Python zip() function to iterate over two lists? This tutorial teaches you exactly what the zip() function does and shows you some creative ways to use the function.

Pandas Weighted Samples

One of the very powerful features of the Pandas .sample() method is to apply different weights to certain rows, meaning that some rows will have a higher chance of being selected than others.

To get started with this example, let’s take a look at the types of penguins we have in our dataset:

print(df['species'].unique())

# Returns: ['Adelie' 'Chinstrap' 'Gentoo']

Say we wanted to give the Chinstrap species a higher chance of being selected. We could apply weights to these species in another column, using the Pandas .map() method. To learn more about the .map() method, check out my in-depth tutorial on mapping values to another column here.

df['weights'] = df['species'].map({'Adelie': 20, 'Chinstrap': 60, 'Gentoo': 20})
sample = df.sample(n=5, weights='weights')
print(sample)

# Returns:
#        species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  weights
# 163  Chinstrap      Dream            51.7           20.3              194.0       3775.0    Male       60
# 223     Gentoo     Biscoe            50.0           15.2              218.0       5700.0    Male       20
# 118     Adelie  Torgersen            35.7           17.0              189.0       3350.0  Female       20
# 217  Chinstrap      Dream            49.6           18.2              193.0       3775.0    Male       60
# 160  Chinstrap      Dream            46.0           18.9              195.0       4150.0  Female       60

We can see here that the Chinstrap species is selected far more than other species.

Let’s break down what we’ve done here:

  1. We mapped in a dictionary of weights into the species column, using the Pandas map method
  2. We then passed our new column into the weights argument as: weights='weights', which instructed Pandas to use the column to assign weights

Some important things to understand about the weights= argument:

  • The values of the weights should add up to 1
  • If the values do not add up to 1, then Pandas will normalize them so that they do.

In the next section, you’ll learn how to sample a dataframe with replacements, meaning that items can be chosen more than a single time.

Want to learn how to get a file’s extension in Python? This tutorial will teach you how to use the os and pathlib libraries to do just that!

Pandas Sample with Replacements

Another helpful feature of the Pandas .sample() method is the ability to sample with replacement, meaning that an item can be sampled more than a single time.

For this, we can use the boolean argument, replace=. By default, this is set to False, meaning that items cannot be sampled more than a single time. By setting it to True, however, the items are placed back into the sampling pile, allowing us to draw them again.

In order to demonstrate this, let’s work with a much smaller dataframe. We’ll filter our dataframe to only be five rows, so that we can see how often each row is sampled:

small_df = df.sample(n=5)
sample = small_df.sample(n=5, replace=True)
print(sample)

# Returns:
#        species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 162  Chinstrap   Dream            46.6           17.8              193.0       3800.0  Female
# 341     Gentoo  Biscoe            50.4           15.7              222.0       5750.0    Male
# 162  Chinstrap   Dream            46.6           17.8              193.0       3800.0  Female
# 45      Adelie   Dream            39.6           18.8              190.0       4600.0    Male
# 59      Adelie  Biscoe            37.6           19.1              194.0       3750.0    Male

Let’s break down what we’ve done here:

  1. We first returned small_df, which contained only five rows from our original dataframe
  2. We then re-sampled our dataframe to return five records. Normally, this would return all five records. However, since we passed in replace=True, Pandas was able to select each records more than once.
  3. Because of this, the record 162 was returned twice.

One interesting thing to note about this is that it can actually return a sample that is larger than the original dataset. For example, if we were to set the frac= argument be 1.2, we would need to set replace=True, since we’d be returned 120% of the original records.

In the next section, you’ll learn how to sample at a constant rate.

Want to learn how to pretty print a JSON file using Python? Learn three different methods to accomplish this using this in-depth tutorial here.

Pandas Sampling Every nth Item (Sampling at a constant rate)

A popular sampling technique is to sample every nth item, meaning that you’re sampling at a constant rate.

In order to do this, we can use the incredibly useful Pandas .iloc accessor, which allows us to access items using slice notation. To learn more about .iloc to select data, check out my tutorial here.

In Python, we can slice data in different ways using slice notation, which follows this pattern:

[start : end : step]

If we wanted to, say, select every 5th record, we could leave the start and end parameters empty (meaning they’d slice from beginning to end) and step over every 5 records.

Let’s see what this would look like:

sample = df[::5]
print(sample.head())

# Returns:
#    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 0   Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
# 5   Adelie  Torgersen            39.3           20.6              190.0       3650.0    Male
# 10  Adelie  Torgersen            37.8           17.1              186.0       3300.0     NaN
# 15  Adelie  Torgersen            36.6           17.8              185.0       3700.0  Female
# 20  Adelie     Biscoe            37.8           18.3              174.0       3400.0  Female

Taking a look at the index of our sample dataframe, we can see that it returns every fifth row. We can set the step counter to be whatever rate we wanted.

In the next section, you’ll learn how to use Pandas to sample items by a given condition.

Want to learn more about calculating the square root in Python? Check out my tutorial here, which will teach you different ways of calculating the square root, both without Python functions and with the help of functions.

Pandas Sampling Items by Conditions

You may also want to sample a Pandas Dataframe using a condition, meaning that you can return all rows the meet (or don’t meet) a certain condition. In order to filter our dataframe using conditions, we use the [] square root indexing method, where we pass a condition into the square roots.

If you want to learn more about how to select items based on conditions, check out my tutorial on selecting data in Pandas.

Say we wanted to filter our dataframe to select only rows where the bill_length_mm are less than 35.

We can write the following:

condition = df['bill_length_mm'] < 35
sample = df[condition]

print(sample.head())

# Returns:
#    species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 8   Adelie  Torgersen            34.1           18.1              193.0       3475.0     NaN
# 14  Adelie  Torgersen            34.6           21.1              198.0       4400.0    Male
# 18  Adelie  Torgersen            34.4           18.4              184.0       3325.0  Female
# 54  Adelie     Biscoe            34.5           18.1              187.0       2900.0  Female
# 70  Adelie  Torgersen            33.5           19.0              190.0       3600.0  Female

We can see here that we returned only rows where the bill length was less than 35.

Rather than splitting the condition off onto a separate line, we could also simply combine it to be written as sample = df[df['bill_length_mm'] < 35] to make our code more concise.

Pandas also comes with a unary operator ~, which negates an operation. We can use this to sample only rows that don’t meet our condition.

Let’s see what this would look like:

sample = df[~(df['bill_length_mm'] < 35)]

print(sample.head())

# Returns:
#   species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
# 0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
# 1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
# 2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female
# 3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
# 4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

We can see here that only rows where the bill length is >35 are returned.

In the next section, you’ll learn how to sample random columns from a Pandas Dataframe.

Want to learn more about Python f-strings? Check out my in-depth tutorial, which includes a step-by-step video to master Python f-strings!

Pandas Sampling Random Columns

In this final section, you’ll learn how to use Pandas to sample random columns of your dataframe. This can be done using the Pandas .sample() method, by changing the axis= parameter equal to 1, rather than the default value of 0.

Let’s see how we can do this using Pandas and Python:

sample = df.sample(n=3,axis=1)
print(sample.head())

# Returns:
#    bill_depth_mm     sex  bill_length_mm
# 0           18.7    Male            39.1
# 1           17.4  Female            39.5
# 2           18.0  Female            40.3
# 3            NaN     NaN             NaN
# 4           19.3  Female            36.7

We can see here that we used Pandas to sample 3 random columns from our dataframe. In this case, all rows are returned but we limited the number of columns that we sampled.

Want to learn how to calculate and use the natural logarithm in Python. Check out my tutorial here, which will teach you everything you need to know about how to calculate it in Python.

Conclusion

In this post, you learned all the different ways in which you can sample a Pandas Dataframe. You learned how to use the Pandas .sample() method, including how to return a set number of rows or a fraction of your dataframe. You also learned how to apply weights to your samples and how to select rows iteratively at a constant rate. You also learned how to sample rows meeting a condition and how to select random columns.

To learn more about sampling, check out this post by Search Business Analytics.

To learn more about the Pandas sample method, check out the official documentation here.