Pickle files are a common storage format for trained machine-learning models. Being able to dive into these with Pandas and explore the data structures can be instrumental in evaluating your data science models.
In this tutorial, you’ll learn how to read pickle files into Pandas DataFrames. The function provides a simple interface to read pickle files, meaning there isn’t a ton of flexibility. That said, it provides enough flexibility to read your files effectively.
By the end of this tutorial, you’ll have learned the following:
- How to use the
pd.read_pickle()
function to read serialized files in Pandas - What the motivation is for using pickle files in machine learning
- How to specify the compression format and specific storage options for working with different providers such as Amazon S3
Table of Contents
Understanding the Pandas read_pickle Function
The Pandas read_pickle function is a relatively simple function for reading data, especially when compared to more exhaustive functions such as the Pandas read_excel function. Let’s take a look at the function and its different parameters:
# Understanding the Pandas read_pickle() Function
import pandas as pd
pd.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)
We can see that the function provides three parameters, only one of which is required:
filepath_or_buffer=
represents the string pointing to where the pickle file is savedcompression=
represents the compression format of the filestorage_options=
allows you to pass in additional information for different storage providers
We can see that the function is relatively simple, which can seem like a blessing compared to more customizable functions such as the Pandas read_csv function, which offers a ton of different parameters.
Tip: Creating the DataFrames Used
The Motivation for Using Pickle Files in Machine Learning
Pickle files are commonplace in machine learning, allowing you to serialize and deserialize Python objects. This means that it involves the process of converting an object into a byte stream in order to maintain the program state across sessions or better transport data, such as to a database.
This is especially important when working with complex data, that can’t easily be saved to normal data formats. Pandas also provides a helpful way to save to pickle files, using the Pandas to_pickle method.
Reading a Pickle File into a Pandas DataFrame
When you have a simple pickle file, those with the extension ending in .pkl
, you can pass the path to the file into the pd.read_pickle()
function. The function accepts local files, URLs, and even more advanced storage options, such as those covered later in this tutorial.
Let’s see how we can pass the path to a file into the read_pickle() function to read the data as a Pandas DataFrame:
# Loading a Pickle File to a Pandas DataFrame
import pandas as pd
df = pd.read_pickle('pickle.pkl')
print(df.head())
# Returns:
# Name Age Location
# 0 Nik 34 Toronto
# 1 Katie 33 NYC
# 2 Evan 27 Atlanta
In the code block above, we imported the Pandas library and then passed the path to a file into the read_pickle()
function. We then printed out the first records of the function by using the .head()
method.
In the following section, you’ll learn how to work with compressed pickle files.
Specifying the Compression Format When Reading a Pickle File with Pandas
Pandas can also read compressed pickle files. By default, these files will have a different extension, matching their compression format. For example, a pickle file with gzip compression will end with the extension of gzip.
Pandas, by default, will infer the compression type by looking at the extension of the file. However, you can specify the compression if you want to be sure Pandas uses the right compression, you can pass a string representing the compression into the compression=
parameter.
# Loading a Pickle File to a Pandas DataFrame with Compression
import pandas as pd
df = pd.read_pickle('pickle.gzip', compression='gzip')
print(df.head())
# Returns:
# Name Age Location
# 0 Nik 34 Toronto
# 1 Katie 33 NYC
# 2 Evan 27 Atlanta
The example above also works if we omit the compression=
parameter, since Pandas by default is set to compression='infer'
.
In the final section below, you’ll learn how to specify different storage options when reading pickle files.
Specifying Storage Options When Reading Pickle Files in Pandas
When working with larger machine learning models, you may also be working with more complex storage options, such as Amazon S3 or Google Cloud. Pandas allows you to read these files directly by using the storage_options= parameter. The parameter accepts a dictionary of the required information.
The example below shows a simple example of how to connect to an Amazon S3 storage account:
# Loading a Pickle File to a Pandas DataFrame from S3 Storage
import pandas as pd
AWS_S3_BUCKET = ''
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''
AWS_SESSION_TOKEN = ''
key = ''
df = pd.read_pickle(
f"s3://{AWS_S3_BUCKET}/{key}",
index=False,
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
}
)
The parameters you need to pass in will vary by the service provider and your configuration. The example above shows a simple configuration.
Conclusion
In this tutorial, you learned how to use the Pandas read_pickle function to read pickle files. You first learned about the different parameters of the function. Then, you learned about the motivations behind using pickle files, especially in the realm of data science. From there, you learned how to use the function to read pickle files, as well as compressed pickle files. Finally, you learned how to read pickle files stored on other storage providers such as Amazon S3.
Additional Resources
To learn more about related topics, check out the resources below: