Skip to content

pd.read_parquet: Read Parquet Files in Pandas

pd read_parquet Read Parquet Files in Pandas Cover Image

In this tutorial, you’ll learn how to use the Pandas read_parquet function to read parquet files in Pandas. While CSV files may be the ubiquitous file format for data analysts, they have limitations as your data size grows. This is where Apache Parquet files can help!

By the end of this tutorial, you’ll have learned:

  • What Apache Parquet files are
  • How to read parquet files with Pandas using the pd.read_parquet() function
  • How to specify which columns to read in a parquet file
  • How to speed up reading parquet files with PyArrow
  • How to specify the engine used to read a parquet file in Pandas

What are Apache Parquet Files?

The Apache Parquet format is a column-oriented data file format. This means data are stored based on columns, rather than by rows. The benefits of this include significantly faster access to data, especially when querying only a subset of columns. This is because only particular can be read, rather than entire records.

The format is an open-source format that is specifically designed for data storage and retrieval. Because of this, its encoding schema is designed for handling massive amounts of data, especially spread across different files.

Understanding the Pandas read_parquet Function

Before diving into using the Pandas read_parquet() function, let’s take a look at the different parameters and default arguments of the function. This will give you a strong understanding of the function’s abilities.

Let’s take a look at the pd.read_parquet() function:

# Understanding the Pandas read_parquet() Function
import pandas as pd

pd.read_parquet(
   path, 
   engine='auto', 
   columns=None, 
   storage_options=None, 
   use_nullable_dtypes=False
)

We can see that the function offers 5 parameters, 4 of which have default arguments provided. The table below breaks down the function’s parameters and provides descriptions of how they can be used.

ParameterDescriptionDefault ArgumentPossible Values
path=The path to the file, which can be a URL (such as S3 or FTP)String or path object
engine=Which parquet library to use‘auto’{‘auto’, ‘pyarrow’, ‘fastparquet’}
columns=Which columns to readNoneList or None
storage_options=Additional options that can be applied to particular storage connections, such as S3NoneList or None
use_nullable_dtypes=If True, use data types that use pd.Na as missing value indicators for the resulting DataFrame.Falseboolean
Understanding the Pandas read_parquet() function

Now that you have a strong understanding of what options the function offers, let’s start learning how to read a parquet file using Pandas.

How to Read a Parquet File Using Pandas read_parquet

To read a Parquet file into a Pandas DataFrame, you can use the pd.read_parquet() function. The function allows you to load data from a variety of different sources. For the purposes of this tutorial, we’ve provided a sample Parquet file here. You can either download the file or simply use the code provided below and load it from Github.

# How to use Pandas read_parquet() To Read a Parquet File
import pandas as pd
url = 'https://github.com/datagy/mediumdata/raw/master/Sample.parquet'
df = pd.read_parquet(url)

print(df.head())

# Returns:
#       Name  Age  Gender
# 0     Jane   10  Female
# 1      Nik   35    Male
# 2     Kate   34  Female
# 3  Melissa   23  Female
# 4     Evan   70    Male

Let’s break down what we did in the code block above:

  1. We import Pandas using the conventional alias of pd
  2. We then loaded an external URL to a variable, url
  3. Finally, we used the pd.read_parquet() function to read the URL into a DataFrame

How to Specify Columns to Read in the Pandas read_parquet Function

One of the main advantages of using the Parquet format is that it is a column-oriented format. This means that when we load only a subset of columns, we can gain some efficiencies. The Pandas read_parquet() function allows us to specify which columns to read using the columns= parameter.

By default, the parameter will be set to None, indicating that the function should read all columns. We can also pass in either a single column label or a sequence of labels to read multiple columns.

Let’s see how we can use Pandas to read only a subset of columns when loading our Parquet file:

# Reading Only a Subset of Columns in Pandas
import pandas as pd
url = 'https://github.com/datagy/mediumdata/raw/master/Sample.parquet'
df = pd.read_parquet(url, columns=['Name', 'Age'])

print(df.head())

# Returns:
#       Name  Age
# 0     Jane   10
# 1      Nik   35
# 2     Kate   34
# 3  Melissa   23
# 4     Evan   70

In the code block above, we reused the code from the earlier example. However, we added in the columns= parameter to specify that we wanted to read only two of the three columns. This allows us to load some of the columns. Because the data format is column-oriented, this means that we can speed up the process significantly.

How to Specify the Engine Used in Pandas read_parquet

Because Parquet is an open-source format, there are many different libraries and engines that can be used to read and write the data. Pandas allows you to customize the engine used to read the data from the file if you know which library is best.

To specify the engine used when reading a Parquet file, you can use the engine= parameter. The parameter defaults to 'auto', which will first try the PyArrow engine. If this fails, then it will try to use the FastParquet library.

Some of the key differences between the two engines are what dependencies are used under the hood. The PyArrow engine uses a C library, while the FastParquet engine uses Numba.

Let’s see how we can specify to use the PyArrow engine when reading a Parquet file in Pandas:

# Specifying the Engine When Reading Parquet Files in Pandas
import pandas as pd
url = 'https://github.com/datagy/mediumdata/raw/master/Sample.parquet'
df = pd.read_parquet(url, engine='pyarrow')

print(df.head())

# Returns:
#       Name  Age  Gender
# 0     Jane   10  Female
# 1      Nik   35    Male
# 2     Kate   34  Female
# 3  Melissa   23  Female
# 4     Evan   70    Male

In the following section, you’ll learn how to speed up reading Parquet files when using PyArrow and Pandas.

How to Speed Up Reading Parquet Files Using PyArrow in Python

When working with large datasets, using Parquet files can still run slower than anticipated. When using the Pandas read_parquet() function to load your data, the operation can be sped up by combining PyArrow into the mix.

We can use the PyArrow utility function to convert a dataset into a Pandas DataFrame, as shown below. PyArrow doesn’t allow us to point to a URL. Because of this, you’ll need to download the data file and use it locally.

# Using PyArrow to Convert Parquet into a Pandas DataFrame
import pandas as pd
import pyarrow.parquet as pq

data = pq.read_table('Sample.parquet')
df = data.to_pandas()

print(df.head())

# Returns:
#       Name  Age  Gender
# 0     Jane   10  Female
# 1      Nik   35    Male
# 2     Kate   34  Female
# 3  Melissa   23  Female
# 4     Evan   70    Male

This method can work significantly faster than simply using Pandas to load a very large dataset. However, you won’t notice significant performance gains until you’re working with very large files.

Conclusion

In this tutorial, you learned how to use Pandas to read parquet files using the read_parquet() function. You first learned what Parquet files are and when you might encounter them. Then, you learned how to use the function to read a sample parquet. Then, you learned how to specify which columns to read and how to change the engine used to read files. Finally, you learned how to use Pandas and PyArrow together to speed up reading very large files.

Additional Resources

To learn more about related topics, check out the tutorials below:

Leave a Reply

Your email address will not be published.