Pandas read_sql: Reading SQL into DataFrames

In this tutorial, you’ll learn how to read SQL tables or queries into a Pandas DataFrame. Given how prevalent SQL is in industry, it’s important to understand how to read SQL into a Pandas DataFrame.

By the end of this tutorial, you’ll have learned the following:

How to use the pd.read_sql() function (and the other Pandas functions for reading SQL)
How to read a SQL table or query into a Pandas DataFrame
How to customize the function’s behavior to set index columns, parse dates, and improve performance by chunking reading the data

Table of Contents

Understanding Functions to Read SQL into Pandas DataFrames

Pandas provides three different functions to read SQL into a DataFrame:

pd.read_sql() – which is a convenience wrapper for the two functions below
pd.read_sql_table() – which reads a table in a SQL database into a DataFrame
pd.read_sql_query() – which reads a SQL query into a DataFrame

Due to its versatility, we’ll focus our attention on the pd.read_sql() function, which can be used to read both tables and queries.

Let’s take a look at the function’s parameters and default arguments:

# Understanding the Pandas read_sql() Functio
import pandas as pd
pd.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)

We can see that we need to provide two arguments:

sql= is used to provide either a SQL table or a SQL query
con= provides a connection to the SQL database

Let’s start off learning how to use the function by first loading a sample sqlite database.

Loading a Sample SQL Database

In the code block below, we provide code for creating a custom SQL database. If, instead, you’re working with your own database feel free to use that, though your results will of course vary. We’re using sqlite here to simplify creating the database:

import sqlite3

#Connect to Database
conn = sqlite3.connect('users') 
cur = conn.cursor()

# Create Table
cur.execute("""CREATE TABLE IF NOT EXISTS users(
   userid INT PRIMARY KEY,
   fname TEXT,
   company TEXT,
   gender TEXT,
   date DATE,
   amount FLOAT);
""")
conn.commit()

# Add User Information
users = [
    ('00001', 'Nik', 'datagy', 'male', '2023-06-01', 12.34), 
    ('00002', 'Lois', 'Daily Planet', 'Female', '2023-07-01', 12.56), 
    ('00003', 'Peter', 'Parker Tech', 'Male', '2023-08-01', 45.67), 
    ('00004', 'Bruce', 'Wayne Enterprises', 'male', '2023-09-01', 123.12)
    ]

cur.executemany("INSERT INTO users VALUES(?, ?, ?, ?, ?, ?);", users)
conn.commit()

In the code block above, we added four records to our database users. Let’s now see how we can load data from our SQL database in Pandas.

How to Read SQL into a Pandas DataFrame

In order to read a SQL table or query into a Pandas DataFrame, you can use the pd.read_sql() function.

The function depends on you having a declared connection to a SQL database. While we won’t go into how to connect to every database, we’ll continue to follow along with our sqlite example. In order to connect to the unprotected database, we can simply declare a connection variable using conn = sqlite3.connect('users').

Let’s take a look at how we can query all records from a table into a DataFrame:

# Reading a SQL Table Into Pandas DataFrame
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn)
print(df.head())

# Returns:
#    userid  fname            company  gender        date  amount
# 0       1    Nik             datagy    male  2023-06-01   12.34
# 1       2   Lois       Daily Planet  Female  2023-07-01   12.56
# 2       3  Peter        Parker Tech    Male  2023-08-01   45.67
# 3       4  Bruce  Wayne Enterprises    male  2023-09-01  123.12

In the code block above, we loaded a Pandas DataFrame using the pd.read_sql() function. The function only has two required parameters:

The SQL query, passed into the sql= parameter, and
The connection to the database, passed into the con= parameter

In the code block, we connected to our SQL database using sqlite. Then, we asked Pandas to query the entirety of the users table. This returned the table shown above.

In the following section, we’ll explore how to set an index column when reading a SQL table.

How to Set an Index Column When Reading SQL into a Pandas DataFrame

Pandas allows you to easily set the index of a DataFrame when reading a SQL query using the pd.read_sql() function. In order to do this, we can add the optional index_col= parameter and pass in the column that we want to use as our index column.

Let’s see how we can use the 'userid' as our index column:

# Setting an Index Column with Pandas read_sql
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid')
print(df.head())

# Returns:
#         fname            company  gender        date  amount
# userid                                                      
# 1         Nik             datagy    male  2023-06-01   12.34
# 2        Lois       Daily Planet  Female  2023-07-01   12.56
# 3       Peter        Parker Tech    Male  2023-08-01   45.67
# 4       Bruce  Wayne Enterprises    male  2023-09-01  123.12

In the code block above, we only added index_col='user_id' into our function call. This returned the DataFrame where our column was correctly set as our index column.

How to Parse Dates When Reading SQL into a Pandas DataFrame

Similar to setting an index column, Pandas can also parse dates. In order to parse a column (or columns) as dates when reading a SQL query using Pandas, you can use the parse_dates= parameter.

Let’s see how we can parse the 'date' column as a datetime data type:

# Parsing Dates When Reading a SQL Query in Pandas
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn, parse_dates=['date'])
print(df.info())

# Returns:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 6 columns):
#  #   Column   Non-Null Count  Dtype         
# ---  ------   --------------  -----         
#  0   userid   4 non-null      int64         
#  1   fname    4 non-null      object        
#  2   company  4 non-null      object        
#  3   gender   4 non-null      object        
#  4   date     4 non-null      datetime64[ns]
#  5   amount   4 non-null      float64       
# dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
# memory usage: 320.0+ bytes

In the code block above we added the parse_dates=['date'] argument into the function call. Note that we’re passing the column label in as a list of columns, even when there is only one. We then used the .info() method to explore the data types and confirm that it read as a date correctly.

How to Chunk SQL Queries to Improve Performance When Reading into Pandas

You’ll often be presented with lots of data when working with SQL databases. In order to improve the performance of your queries, you can chunk your queries to reduce how many records are read at a time.

In order to chunk your SQL queries with Pandas, you can pass in a record size in the chunksize= parameter. This returns a generator object, as shown below:

# Reading SQL Queries in Chunks
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn, chunksize=2)
print(df)

# Returns:
# <generator object SQLiteDatabase._query_iterator at 0x7fa8b16cc970>

We can see that when using the chunksize= parameter, that Pandas returns a generator object. We can iterate over the resulting object using a Python for-loop. We then use the Pandas concat function to combine our DataFrame into one big DataFrame.

# Reading SQL Queries in Chunks
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.DataFrame()
for chunk in pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid', chunksize=2):
    df = pd.concat([df, chunk])
print(df)

# Returns:
#         fname            company  gender        date  amount
# userid                                                      
# 1         Nik             datagy    male  2023-06-01   12.34
# 2        Lois       Daily Planet  Female  2023-07-01   12.56
# 3       Peter        Parker Tech    Male  2023-08-01   45.67
# 4       Bruce  Wayne Enterprises    male  2023-09-01  123.12

While our actual query was quite small, imagine working with datasets that have millions of records. Being able to split this into different chunks can reduce the overall workload on your servers.

Conclusion

In this tutorial, you learned how to use the Pandas read_sql() function to query data from a SQL database into a Pandas DataFrame. Given how ubiquitous SQL databases are in production environments, being able to incorporate them into Pandas can be a great skill. You learned about how Pandas offers three different functions to read SQL.

You first learned how to understand the different parameters of the function. Then, you walked through step-by-step examples, including reading a simple query, setting index columns, and parsing dates. We closed off the tutorial by chunking our queries to improve performance.

Additional Resources

To learn more about related topics, check out the resources below:

Understanding Functions to Read SQL into Pandas DataFrames

Loading a Sample SQL Database

How to Read SQL into a Pandas DataFrame

How to Set an Index Column When Reading SQL into a Pandas DataFrame

How to Parse Dates When Reading SQL into a Pandas DataFrame

How to Chunk SQL Queries to Improve Performance When Reading into Pandas

Conclusion

Additional Resources

Nik Piepenbreier

Leave a Reply Cancel reply

Pandas read_sql: Reading SQL into DataFrames

Understanding Functions to Read SQL into Pandas DataFrames

Loading a Sample SQL Database

How to Read SQL into a Pandas DataFrame

How to Set an Index Column When Reading SQL into a Pandas DataFrame

How to Parse Dates When Reading SQL into a Pandas DataFrame

How to Chunk SQL Queries to Improve Performance When Reading into Pandas

Conclusion

Additional Resources

Nik Piepenbreier

Leave a Reply Cancel reply

Thank you!