In this tutorial, you’ll learn how to read SQL tables or queries into a Pandas DataFrame. Given how prevalent SQL is in industry, it’s important to understand how to read SQL into a Pandas DataFrame.
By the end of this tutorial, you’ll have learned the following:
- How to use the
pd.read_sql()
function (and the other Pandas functions for reading SQL) - How to read a SQL table or query into a Pandas DataFrame
- How to customize the function’s behavior to set index columns, parse dates, and improve performance by chunking reading the data
Table of Contents
Understanding Functions to Read SQL into Pandas DataFrames
Pandas provides three different functions to read SQL into a DataFrame:
pd.read_sql()
– which is a convenience wrapper for the two functions belowpd.read_sql_table()
– which reads a table in a SQL database into a DataFramepd.read_sql_query()
– which reads a SQL query into a DataFrame
Due to its versatility, we’ll focus our attention on the pd.read_sql()
function, which can be used to read both tables and queries.
Let’s take a look at the function’s parameters and default arguments:
# Understanding the Pandas read_sql() Functio
import pandas as pd
pd.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)
We can see that we need to provide two arguments:
sql=
is used to provide either a SQL table or a SQL querycon=
provides a connection to the SQL database
Let’s start off learning how to use the function by first loading a sample sqlite database.
Loading a Sample SQL Database
In the code block below, we provide code for creating a custom SQL database. If, instead, you’re working with your own database feel free to use that, though your results will of course vary. We’re using sqlite here to simplify creating the database:
import sqlite3
#Connect to Database
conn = sqlite3.connect('users')
cur = conn.cursor()
# Create Table
cur.execute("""CREATE TABLE IF NOT EXISTS users(
userid INT PRIMARY KEY,
fname TEXT,
company TEXT,
gender TEXT,
date DATE,
amount FLOAT);
""")
conn.commit()
# Add User Information
users = [
('00001', 'Nik', 'datagy', 'male', '2023-06-01', 12.34),
('00002', 'Lois', 'Daily Planet', 'Female', '2023-07-01', 12.56),
('00003', 'Peter', 'Parker Tech', 'Male', '2023-08-01', 45.67),
('00004', 'Bruce', 'Wayne Enterprises', 'male', '2023-09-01', 123.12)
]
cur.executemany("INSERT INTO users VALUES(?, ?, ?, ?, ?, ?);", users)
conn.commit()
In the code block above, we added four records to our database users
. Let’s now see how we can load data from our SQL database in Pandas.
How to Read SQL into a Pandas DataFrame
In order to read a SQL table or query into a Pandas DataFrame, you can use the pd.read_sql()
function.
The function depends on you having a declared connection to a SQL database. While we won’t go into how to connect to every database, we’ll continue to follow along with our sqlite example. In order to connect to the unprotected database, we can simply declare a connection variable using conn = sqlite3.connect('users')
.
Let’s take a look at how we can query all records from a table into a DataFrame:
# Reading a SQL Table Into Pandas DataFrame
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')
df = pd.read_sql(sql="SELECT * FROM users", con=conn)
print(df.head())
# Returns:
# userid fname company gender date amount
# 0 1 Nik datagy male 2023-06-01 12.34
# 1 2 Lois Daily Planet Female 2023-07-01 12.56
# 2 3 Peter Parker Tech Male 2023-08-01 45.67
# 3 4 Bruce Wayne Enterprises male 2023-09-01 123.12
In the code block above, we loaded a Pandas DataFrame using the pd.read_sql()
function. The function only has two required parameters:
- The SQL query, passed into the
sql=
parameter, and - The connection to the database, passed into the
con=
parameter
In the code block, we connected to our SQL database using sqlite. Then, we asked Pandas to query the entirety of the users
table. This returned the table shown above.
In the following section, we’ll explore how to set an index column when reading a SQL table.
How to Set an Index Column When Reading SQL into a Pandas DataFrame
Pandas allows you to easily set the index of a DataFrame when reading a SQL query using the pd.read_sql()
function. In order to do this, we can add the optional index_col=
parameter and pass in the column that we want to use as our index column.
Let’s see how we can use the 'userid'
as our index column:
# Setting an Index Column with Pandas read_sql
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')
df = pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid')
print(df.head())
# Returns:
# fname company gender date amount
# userid
# 1 Nik datagy male 2023-06-01 12.34
# 2 Lois Daily Planet Female 2023-07-01 12.56
# 3 Peter Parker Tech Male 2023-08-01 45.67
# 4 Bruce Wayne Enterprises male 2023-09-01 123.12
In the code block above, we only added index_col='user_id'
into our function call. This returned the DataFrame where our column was correctly set as our index column.
How to Parse Dates When Reading SQL into a Pandas DataFrame
Similar to setting an index column, Pandas can also parse dates. In order to parse a column (or columns) as dates when reading a SQL query using Pandas, you can use the parse_dates=
parameter.
Let’s see how we can parse the 'date'
column as a datetime data type:
# Parsing Dates When Reading a SQL Query in Pandas
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')
df = pd.read_sql(sql="SELECT * FROM users", con=conn, parse_dates=['date'])
print(df.info())
# Returns:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 6 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 userid 4 non-null int64
# 1 fname 4 non-null object
# 2 company 4 non-null object
# 3 gender 4 non-null object
# 4 date 4 non-null datetime64[ns]
# 5 amount 4 non-null float64
# dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
# memory usage: 320.0+ bytes
In the code block above we added the parse_dates=['date']
argument into the function call. Note that we’re passing the column label in as a list of columns, even when there is only one. We then used the .info()
method to explore the data types and confirm that it read as a date correctly.
How to Chunk SQL Queries to Improve Performance When Reading into Pandas
You’ll often be presented with lots of data when working with SQL databases. In order to improve the performance of your queries, you can chunk your queries to reduce how many records are read at a time.
In order to chunk your SQL queries with Pandas, you can pass in a record size in the chunksize=
parameter. This returns a generator object, as shown below:
# Reading SQL Queries in Chunks
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')
df = pd.read_sql(sql="SELECT * FROM users", con=conn, chunksize=2)
print(df)
# Returns:
# <generator object SQLiteDatabase._query_iterator at 0x7fa8b16cc970>
We can see that when using the chunksize=
parameter, that Pandas returns a generator object. We can iterate over the resulting object using a Python for-loop. We then use the Pandas concat function to combine our DataFrame into one big DataFrame.
# Reading SQL Queries in Chunks
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')
df = pd.DataFrame()
for chunk in pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid', chunksize=2):
df = pd.concat([df, chunk])
print(df)
# Returns:
# fname company gender date amount
# userid
# 1 Nik datagy male 2023-06-01 12.34
# 2 Lois Daily Planet Female 2023-07-01 12.56
# 3 Peter Parker Tech Male 2023-08-01 45.67
# 4 Bruce Wayne Enterprises male 2023-09-01 123.12
While our actual query was quite small, imagine working with datasets that have millions of records. Being able to split this into different chunks can reduce the overall workload on your servers.
Conclusion
In this tutorial, you learned how to use the Pandas read_sql()
function to query data from a SQL database into a Pandas DataFrame. Given how ubiquitous SQL databases are in production environments, being able to incorporate them into Pandas can be a great skill. You learned about how Pandas offers three different functions to read SQL.
You first learned how to understand the different parameters of the function. Then, you walked through step-by-step examples, including reading a simple query, setting index columns, and parsing dates. We closed off the tutorial by chunking our queries to improve performance.
Additional Resources
To learn more about related topics, check out the resources below: