Skip to content

Pandas read_sql: Reading SQL into DataFrames

Pandas read_sql Reading SQL into DataFrames Cover Image

In this tutorial, you’ll learn how to read SQL tables or queries into a Pandas DataFrame. Given how prevalent SQL is in industry, it’s important to understand how to read SQL into a Pandas DataFrame.

By the end of this tutorial, you’ll have learned the following:

  • How to use the pd.read_sql() function (and the other Pandas functions for reading SQL)
  • How to read a SQL table or query into a Pandas DataFrame
  • How to customize the function’s behavior to set index columns, parse dates, and improve performance by chunking reading the data

Understanding Functions to Read SQL into Pandas DataFrames

Pandas provides three different functions to read SQL into a DataFrame:

  1. pd.read_sql() – which is a convenience wrapper for the two functions below
  2. pd.read_sql_table() – which reads a table in a SQL database into a DataFrame
  3. pd.read_sql_query() – which reads a SQL query into a DataFrame

Due to its versatility, we’ll focus our attention on the pd.read_sql() function, which can be used to read both tables and queries.

Let’s take a look at the function’s parameters and default arguments:

# Understanding the Pandas read_sql() Functio
import pandas as pd
pd.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)

We can see that we need to provide two arguments:

  1. sql= is used to provide either a SQL table or a SQL query
  2. con= provides a connection to the SQL database

Let’s start off learning how to use the function by first loading a sample sqlite database.

Loading a Sample SQL Database

In the code block below, we provide code for creating a custom SQL database. If, instead, you’re working with your own database feel free to use that, though your results will of course vary. We’re using sqlite here to simplify creating the database:

import sqlite3

#Connect to Database
conn = sqlite3.connect('users') 
cur = conn.cursor()

# Create Table
cur.execute("""CREATE TABLE IF NOT EXISTS users(
   userid INT PRIMARY KEY,
   fname TEXT,
   company TEXT,
   gender TEXT,
   date DATE,
   amount FLOAT);
""")
conn.commit()

# Add User Information
users = [
    ('00001', 'Nik', 'datagy', 'male', '2023-06-01', 12.34), 
    ('00002', 'Lois', 'Daily Planet', 'Female', '2023-07-01', 12.56), 
    ('00003', 'Peter', 'Parker Tech', 'Male', '2023-08-01', 45.67), 
    ('00004', 'Bruce', 'Wayne Enterprises', 'male', '2023-09-01', 123.12)
    ]

cur.executemany("INSERT INTO users VALUES(?, ?, ?, ?, ?, ?);", users)
conn.commit()

In the code block above, we added four records to our database users. Let’s now see how we can load data from our SQL database in Pandas.

How to Read SQL into a Pandas DataFrame

In order to read a SQL table or query into a Pandas DataFrame, you can use the pd.read_sql() function.

The function depends on you having a declared connection to a SQL database. While we won’t go into how to connect to every database, we’ll continue to follow along with our sqlite example. In order to connect to the unprotected database, we can simply declare a connection variable using conn = sqlite3.connect('users').

Let’s take a look at how we can query all records from a table into a DataFrame:

# Reading a SQL Table Into Pandas DataFrame
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn)
print(df.head())

# Returns:
#    userid  fname            company  gender        date  amount
# 0       1    Nik             datagy    male  2023-06-01   12.34
# 1       2   Lois       Daily Planet  Female  2023-07-01   12.56
# 2       3  Peter        Parker Tech    Male  2023-08-01   45.67
# 3       4  Bruce  Wayne Enterprises    male  2023-09-01  123.12

In the code block above, we loaded a Pandas DataFrame using the pd.read_sql() function. The function only has two required parameters:

  1. The SQL query, passed into the sql= parameter, and
  2. The connection to the database, passed into the con= parameter

In the code block, we connected to our SQL database using sqlite. Then, we asked Pandas to query the entirety of the users table. This returned the table shown above.

In the following section, we’ll explore how to set an index column when reading a SQL table.

How to Set an Index Column When Reading SQL into a Pandas DataFrame

Pandas allows you to easily set the index of a DataFrame when reading a SQL query using the pd.read_sql() function. In order to do this, we can add the optional index_col= parameter and pass in the column that we want to use as our index column.

Let’s see how we can use the 'userid' as our index column:

# Setting an Index Column with Pandas read_sql
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid')
print(df.head())

# Returns:
#         fname            company  gender        date  amount
# userid                                                      
# 1         Nik             datagy    male  2023-06-01   12.34
# 2        Lois       Daily Planet  Female  2023-07-01   12.56
# 3       Peter        Parker Tech    Male  2023-08-01   45.67
# 4       Bruce  Wayne Enterprises    male  2023-09-01  123.12

In the code block above, we only added index_col='user_id' into our function call. This returned the DataFrame where our column was correctly set as our index column.

How to Parse Dates When Reading SQL into a Pandas DataFrame

Similar to setting an index column, Pandas can also parse dates. In order to parse a column (or columns) as dates when reading a SQL query using Pandas, you can use the parse_dates= parameter.

Let’s see how we can parse the 'date' column as a datetime data type:

# Parsing Dates When Reading a SQL Query in Pandas
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn, parse_dates=['date'])
print(df.info())

# Returns:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 6 columns):
#  #   Column   Non-Null Count  Dtype         
# ---  ------   --------------  -----         
#  0   userid   4 non-null      int64         
#  1   fname    4 non-null      object        
#  2   company  4 non-null      object        
#  3   gender   4 non-null      object        
#  4   date     4 non-null      datetime64[ns]
#  5   amount   4 non-null      float64       
# dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
# memory usage: 320.0+ bytes

In the code block above we added the parse_dates=['date'] argument into the function call. Note that we’re passing the column label in as a list of columns, even when there is only one. We then used the .info() method to explore the data types and confirm that it read as a date correctly.

How to Chunk SQL Queries to Improve Performance When Reading into Pandas

You’ll often be presented with lots of data when working with SQL databases. In order to improve the performance of your queries, you can chunk your queries to reduce how many records are read at a time.

In order to chunk your SQL queries with Pandas, you can pass in a record size in the chunksize= parameter. This returns a generator object, as shown below:

# Reading SQL Queries in Chunks
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.read_sql(sql="SELECT * FROM users", con=conn, chunksize=2)
print(df)

# Returns:
# <generator object SQLiteDatabase._query_iterator at 0x7fa8b16cc970>

We can see that when using the chunksize= parameter, that Pandas returns a generator object. We can iterate over the resulting object using a Python for-loop. We then use the Pandas concat function to combine our DataFrame into one big DataFrame.

# Reading SQL Queries in Chunks
import pandas as pd
import sqlite3
conn = sqlite3.connect('users')

df = pd.DataFrame()
for chunk in pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid', chunksize=2):
    df = pd.concat([df, chunk])
print(df)

# Returns:
#         fname            company  gender        date  amount
# userid                                                      
# 1         Nik             datagy    male  2023-06-01   12.34
# 2        Lois       Daily Planet  Female  2023-07-01   12.56
# 3       Peter        Parker Tech    Male  2023-08-01   45.67
# 4       Bruce  Wayne Enterprises    male  2023-09-01  123.12

While our actual query was quite small, imagine working with datasets that have millions of records. Being able to split this into different chunks can reduce the overall workload on your servers.

Conclusion

In this tutorial, you learned how to use the Pandas read_sql() function to query data from a SQL database into a Pandas DataFrame. Given how ubiquitous SQL databases are in production environments, being able to incorporate them into Pandas can be a great skill. You learned about how Pandas offers three different functions to read SQL.

You first learned how to understand the different parameters of the function. Then, you walked through step-by-step examples, including reading a simple query, setting index columns, and parsing dates. We closed off the tutorial by chunking our queries to improve performance.

Additional Resources

To learn more about related topics, check out the resources below:

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

Leave a Reply

Your email address will not be published. Required fields are marked *