Pandas read_pickle - Reading Pickle Files to DataFrames • datagy

Pickle files are a common storage format for trained machine-learning models. Being able to dive into these with Pandas and explore the data structures can be instrumental in evaluating your data science models.

In this tutorial, you’ll learn how to read pickle files into Pandas DataFrames. The function provides a simple interface to read pickle files, meaning there isn’t a ton of flexibility. That said, it provides enough flexibility to read your files effectively.

By the end of this tutorial, you’ll have learned the following:

How to use the pd.read_pickle() function to read serialized files in Pandas
What the motivation is for using pickle files in machine learning
How to specify the compression format and specific storage options for working with different providers such as Amazon S3

Table of Contents

Understanding the Pandas read_pickle Function

The Pandas read_pickle function is a relatively simple function for reading data, especially when compared to more exhaustive functions such as the Pandas read_excel function. Let’s take a look at the function and its different parameters:

# Understanding the Pandas read_pickle() Function
import pandas as pd
pd.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)

We can see that the function provides three parameters, only one of which is required:

filepath_or_buffer= represents the string pointing to where the pickle file is saved
compression= represents the compression format of the file
storage_options= allows you to pass in additional information for different storage providers

We can see that the function is relatively simple, which can seem like a blessing compared to more customizable functions such as the Pandas read_csv function, which offers a ton of different parameters.

Tip: Creating the DataFrames Used

To create the DataFrames we’re using in this tutorial, check out my guide on using the Pandas to_pickle() function, which provides all the source code to create these pickle files yourself.

The Motivation for Using Pickle Files in Machine Learning

Pickle files are commonplace in machine learning, allowing you to serialize and deserialize Python objects. This means that it involves the process of converting an object into a byte stream in order to maintain the program state across sessions or better transport data, such as to a database.

This is especially important when working with complex data, that can’t easily be saved to normal data formats. Pandas also provides a helpful way to save to pickle files, using the Pandas to_pickle method.

Reading a Pickle File into a Pandas DataFrame

When you have a simple pickle file, those with the extension ending in .pkl, you can pass the path to the file into the pd.read_pickle() function. The function accepts local files, URLs, and even more advanced storage options, such as those covered later in this tutorial.

Let’s see how we can pass the path to a file into the read_pickle() function to read the data as a Pandas DataFrame:

# Loading a Pickle File to a Pandas DataFrame
import pandas as pd
df = pd.read_pickle('pickle.pkl')

print(df.head())

# Returns:
#     Name  Age Location
# 0    Nik   34  Toronto
# 1  Katie   33      NYC
# 2   Evan   27  Atlanta

In the code block above, we imported the Pandas library and then passed the path to a file into the read_pickle() function. We then printed out the first records of the function by using the .head() method.

In the following section, you’ll learn how to work with compressed pickle files.

Specifying the Compression Format When Reading a Pickle File with Pandas

Pandas can also read compressed pickle files. By default, these files will have a different extension, matching their compression format. For example, a pickle file with gzip compression will end with the extension of gzip.

Pandas, by default, will infer the compression type by looking at the extension of the file. However, you can specify the compression if you want to be sure Pandas uses the right compression, you can pass a string representing the compression into the compression= parameter.

# Loading a Pickle File to a Pandas DataFrame with Compression
import pandas as pd
df = pd.read_pickle('pickle.gzip', compression='gzip')

print(df.head())

# Returns:
#     Name  Age Location
# 0    Nik   34  Toronto
# 1  Katie   33      NYC
# 2   Evan   27  Atlanta

The example above also works if we omit the compression= parameter, since Pandas by default is set to compression='infer'.

In the final section below, you’ll learn how to specify different storage options when reading pickle files.

Specifying Storage Options When Reading Pickle Files in Pandas

When working with larger machine learning models, you may also be working with more complex storage options, such as Amazon S3 or Google Cloud. Pandas allows you to read these files directly by using the storage_options= parameter. The parameter accepts a dictionary of the required information.

The example below shows a simple example of how to connect to an Amazon S3 storage account:

# Loading a Pickle File to a Pandas DataFrame from S3 Storage
import pandas as pd

AWS_S3_BUCKET = ''
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
AWS_SESSION_TOKEN = ''
key = ''

df = pd.read_pickle(
    f"s3://{AWS_S3_BUCKET}/{key}",
    index=False,
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
    }
)

The parameters you need to pass in will vary by the service provider and your configuration. The example above shows a simple configuration.

Conclusion

In this tutorial, you learned how to use the Pandas read_pickle function to read pickle files. You first learned about the different parameters of the function. Then, you learned about the motivations behind using pickle files, especially in the realm of data science. From there, you learned how to use the function to read pickle files, as well as compressed pickle files. Finally, you learned how to read pickle files stored on other storage providers such as Amazon S3.

Additional Resources

To learn more about related topics, check out the resources below:

Pandas read_pickle – Reading Pickle Files to DataFrames

Understanding the Pandas read_pickle Function

Tip: Creating the DataFrames Used

The Motivation for Using Pickle Files in Machine Learning

Reading a Pickle File into a Pandas DataFrame

Specifying the Compression Format When Reading a Pickle File with Pandas

Specifying Storage Options When Reading Pickle Files in Pandas

Conclusion

Additional Resources

Nik Piepenbreier

Leave a Reply Cancel reply

Pandas read_pickle – Reading Pickle Files to DataFrames

Understanding the Pandas read_pickle Function

Tip: Creating the DataFrames Used

The Motivation for Using Pickle Files in Machine Learning

Reading a Pickle File into a Pandas DataFrame

Specifying the Compression Format When Reading a Pickle File with Pandas

Specifying Storage Options When Reading Pickle Files in Pandas

Conclusion

Additional Resources

Nik Piepenbreier

Leave a Reply Cancel reply

Thank you!