Skip to content

Use Pandas & Python to Extract Tables from Webpages (read_html)

Use Pandas & Python to Extract Tables from Webpages (read_html) Cover Image

You may find yourself in a position where you need to use Python to extract tables from a webpage to gather data, and you’ll be thinking of using Python. Perhaps you’ve heard of libraries like Beautiful Soup. But with data that’s structured in tables, you can use Pandas to easily get web data for you as well! We’ll use this post to explore how to scrape web tables easily with Python and turn them into functional dataframes!

How To Scrape Web Tables with Python

In order to easily extract tables from a webpage with Python, we’ll need to use Pandas. If you haven’t already done so, install Pandas with either pip or conda.

pip install pandas #or
conda install pandas

From there, we can import the library using:

import pandas as pd

For this example, we’ll want to scrape the data tables available on the World Population Wikipedia article. There are plenty of tables available on the page. Here we show just a few, but take a moment to explore the different tables that are available:

A picture showing a number of tables available on the web page. The article will explore how to easily scrape the tables using Python.

In other posts, like this one on un-pivoting data, we explored how to load data into a Pandas dataframe. We’ll take a slightly different approach this time and use the pd.read_html function:

dfs = pd.read_html('http://en.wikipedia.org/wiki/World_population')

You may notice two things here:

  1. We named our variable dfs, as this function generates a list of all the dataframes it pulls, and
  2. We removed the s from https, as the function runs better on http.

It may not be immediately intuitive to find the order in which tables appear, but they are read in the order in which they appear in the HTML code of the site. This can be accessed by right-clicking and selecting View Page Source (this may vary depending on browser and operating system):

Selecting page source in Firefox

We can also explore the different dataframes directly in Python, in the same way that we would access a list item. If we wanted to print out the third dataframe, we could write:

print(dfs[2])

# Returns:
# Continent Density(inhabitants/km2)  ...                              Most populous country             Most populous city (metropolitan area)
# 0                   Asia                     96.4  ...                      1,382,300,000[note 1] – China  35,676,000/13,634,685 – Greater Tokyo Area/Tok...
# 1                 Africa                     36.7  ...                             0186,987,000 – Nigeria                            20,500,000 – Cairo [17]
# 2                 Europe                     72.9  ...  0145,939,000 – Russia;approx. 112 million in E...  16,855,000/12,506,468 – Moscow metropolitan ar...
# 3  North America[note 2]                     22.9  ...                       0324,991,600 – United States  23,723,696/8,537,673 – New York Metropolitan A...
# 4          South America                     22.8  ...                              0209,567,000 – Brazil  27,640,577/11,316,149 – Metro Area/São Paulo City
# 5                Oceania                      4.5  ...                           0024,458,800 – Australia                                 5,005,400 – Sydney
# 6             Antarctica           0.0003(varies)  ...                                        N/A[note 3]    1,200 (non-permanent, varies) – McMurdo Station
[7 rows x 5 columns]

If we now wanted to assign this table to a dataframe, we can give it a meaningful name by writing:

pop_by_continent = dfs[3]

We can then write helpful Pandas commands such as the .head() function or the describe function.

Conclusion: Use Python to Extract Tables from Webpages

In this post, we explored how to easily scrape web tables with Python, using the always-powerful Pandas. To learn more about the function available in Pandas, check out its official documentation.

Additional Resources

To learn more about related topics, check out the tutorials below:

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

Leave a Reply

Your email address will not be published. Required fields are marked *