In this tutorial, you’ll learn how to dive into the wonderful world of Pandas. Pandas is a Python package that provides fast and flexible data structures used for data manipulation and analysis. By the end of this tutorial, you’ll have learned how to:
- Install pandas for Python using
pip
orconda
- Understand the pandas
series
anddataframe
objects - Create a dataframe from scratch, and
- Import a dataframe from a
.csv
file or an.xls
file
Table of Contents
What is Pandas for Python?
Pandas is a Python package that allows you to work with tabular data and provides many helpful methods and functions to help you manipulate and analyze your data. Python is incredibly well suited to work with many different types of data (such as strings, integers, dates and times) in a tabular format.
It can read data from many different sources, such as from the internet, from Excel files, and SQL databases. Data can be ordered or underordered and supported by different data types. The library also handles missing data incredibly well and allows you to update, insert, and delete data using vectorized
formats.
Data are stored in two main data structures, Pandas series
and Pandas dataframe
objects. These structures can easily be sorted, filtered, and merged with other data.
All that to say, Pandas is an incredible addition to the Python environment. It’s supported in a wide spectrum of domains (including finance, data science, and science) and is recognized as one of the pillar libraries of Python. Because of its ease of use and wide applications for data science, you’ll learn about Pandas over the next few sections!
How Do You Install Pandas in Python?
pandas
is not part of the standard Python library, because of this you’ll need to install it first. Thankfully, this is quite easy using either pip
or conda
. We can do this by writing either of these lines into our terminal or command line, depending on your preference for managing packages:
pip install pandas
# conda install pandas
As you run through the installation, you’ll notice that a lot of dependencies will also be installed. Once you have successfully installed pandas, you can import the library as you normally would import a package. Conventionally, pandas is imported with the alias of pd
:
# Importing pandas
import pandas as pd
If this code runs without any errors being raised, then your installation of pandas went successfully! In the following section, you’ll learn about the Pandas series
data structure.
What is a Pandas Series?
The Pandas series
object is a one-dimensional data structure. You can think of this as being comparable to a column in a table. Pandas describes this as a one-dimensional homogenously-typed array. This means that the data are aligned along a single aix and are of the same data type.
A Pandas series is a class of data belonging that, well, contains data, is indexed, and has a particular data type. Let’s see how we can create a Pandas series:
# Creating a Pandas series
import pandas as pd
sample_series = pd.Series()
print(sample_series)
# Returns: Series([], dtype: float64)
We can see that the series contains a list-like structure. We can see what this looks like in more in more detail by creating a series that contains data:
# Creating a Pandas series with data
sample_series = pd.Series(['Nik', 'Kate', 'Jane', 'Jim'])
print(sample_series)
# Returns:
# 0 Nik
# 1 Kate
# 2 Jane
# 3 Jim
# dtype: object
We can see here that our series contains the data that we passed into it. in this case, we passed in a list and the data was converted into a series object. On the left, we see our index. By default, Pandas will use the index from 0 through the end of the series. Let’s see how we can pass in a series index as well:
# Creating a Pandas Series with an Index
sample_series = pd.Series(data = [33, 32, 40, 20], index=['Nik', 'Kate', 'Jane', 'Jim'])
print(sample_series)
# Returns:
# Nik 33
# Kate 32
# Jane 40
# Jim 20
# dtype: int64
In the example above, we created a Pandas series by:
- Calling the
Series
constructor - Passing in the agument of
data
, which allows us, in this case, to pass in a list of data - Passing in the
index
argument, which allows us to pass in a list of indices
We can see that this is much, much different than creating a normal Python list. We can see that each item in our series is accessible now by its location (similar to a Python list), but also by its named index. For example, we can now access a person’s age by simply indexing the name provided in the index:
# Accessing a Pandas series item
print(sample_series['Nik'])
# Returns: 33
It’s important that we pass in arrays of the same length when creating an index and data. If we don’t Python will raise a ValueError
:
# Raising a ValueError When Creating a Series
sample_series = pd.Series(data = [33, 32, 40, 20, 33], index=['Nik', 'Kate', 'Jane', 'Jim'])
# Raises: ValueError: Length of values (5) does not match length of index (4)
Pandas Series are a key building block of the Pandas dataframe. In the next section, you’ll dive into the world of Pandas dataframes!
What is a Pandas DataFrame?
A Pandas dataframe
, put simply, is a table. The dataframe contains both rows and columns, each of which are labelled. The DataFrame object contains individual records, each containing different values. Each value in a DataFrame corresponds to both a row (a record) or a column.
Why Have Two Data Structures?
In the previous section, you learned about the Pandas Series
. We equated the Pandas Series as being related to a column in a table. Similarly, we can think of a DataFrame as a container for Pandas’s Series objects.
Why would Pandas created two data structures, when really what we want is the tabular DataFrame? The two structures allow us to work flexibly and add and remove data in a dictionary-like fashion. It also allows us to achieve incredible flexibility while working with data. We don’t need to worry about the size or dimensionality of the data and it allows us to access, modify and retrieve data with great flexibility.
Creating a Pandas DataFrame
Let’s now take a look at how we can create a Pandas DataFrame. In the previous section, you learned to consider a Pandas DataFrame as a collection of Pandas Series objects. One of the ways in which we can create a DataFrame is by passing in a dictionary of data. Each key represents the column of our DataFrame and the key’s value represents a list of data belonging to that column.
Let’s create a small DataFrame to practice:
# Creating a DataFrame from Scratch
df = pd.DataFrame({
'Name': ['Nik', 'Kate', 'Jane', 'Evan', 'Jim', 'Moe', 'Samira'],
'Age': [33, 32, 20, 40, 22, 50, 76]
})
print(df)
# Returns:
# Name Age
# 0 Nik 33
# 1 Kate 32
# 2 Jane 20
# 3 Evan 40
# 4 Jim 22
# 5 Moe 50
# 6 Samira 76
Great work! You just created your first Pandas DataFrame. We can see here that we created a new Pandas DataFrame using a dictionary.
Using Pandas Methods to Display Data
Now that you’ve created your first dataframe, let’s take a look at a couple of ways in which we can use Pandas to display our data. While you’ll learn significantly more about selecting data in Pandas in the subsequent tutorials, let’s start looking at just the first or last few rows of our dataset.
We can use the Pandas .head()
DataFrame method to access the first five records of a DataFrame. Let’s see what this looks like:
# Returning the first five records of a Pandas DataFrame
print(df.head())
# Returns:
# Name Age
# 0 Nik 33
# 1 Kate 32
# 2 Jane 20
# 3 Evan 40
# 4 Jim 22
The head method also allows us to insert a different argument for the number of records to return. By default, it’ll return the first five. However, say we wanted to return only two records, we could pass in the value of 2
into the method:
# Returning the first two records of a Pandas DataFrame
print(df.head(2))
# Returns:
# Name Age
# 0 Nik 33
# 1 Kate 32
Similarly, we can return the last records of a dataframe using the .tail()
method. Similar to the .head()
method, the .tail()
method will return the last five records. Passing in another integer value will modify this behaviour. Let’s take a look at how we can return the last three records:
# Returning the last three records of a Pandas DataFrame
print(df.tail(3))
# Returns:
# Name Age
# 4 Jim 22
# 5 Moe 50
# 6 Samira 76
In the next section, you’ll learn how to create a Pandas DataFrame from a file.
Import a Pandas Dataframe from a File
Creating a DataFrame from scratch can be quite a bit of work and you likely already have files that you want convert into a Pandas DataFrame. Because of this, you’ll learn how to do just this in this section!
One of the most common file formats you’ll encounter is a .csv
file. These files are data stored as text, such as shown below:
date,gender,region,sales
8/22/2022,Male,North-West,20381
3/5/2022,Male,North-East,14495
2/9/2022,Male,North-East,13510
6/22/2022,Male,North-East,15983
You can see here that the first five rows of our sample .csv
files really are just text. CSV stands for comma-seperated values – which makes sense, given that our values are, in fact, separated by commas. While it’s not a requirement for a comma to be the delimiter between data, it is a common convention. The first row represents the column headers and the following rows are our data.
Pandas makes reading a .csv
file into a DataFrame quite easy! The library comes with an aptly-named function, .read_csv()
, which allows you to read a .csv
file into a dataframe. The first argument is the path to the file, which can either be a relative path, an absolute path, or even a URL. In this case, we’re going to be loading a DataFrame from a file hosted on github.
Let’s load a .csv
file into a DataFrame:
# Loading a csv file into a Pandas DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv')
print(df.head())
# Returns:
# date gender region sales
# 0 8/22/2022 Male North-West 20381
# 1 3/5/2022 Male North-East 14495
# 2 2/9/2022 Male North-East 13510
# 3 6/22/2022 Male North-East 15983
# 4 8/10/2022 Female North-West 15007
While the .read_csv()
function takes a number of different parameters, the most important one, of course, is the path to the file.
Reading an Excel file instead?
Pandas also provides a dedicated function for reading Excel files, both in .xls
and .xlsx
formats! This function is the, equally aptly-named, .read_excel()
function. In other sections of this course, you’ll learn different ways of reading data.
Exercises
Now that you’ve a dataframe from the internet, let’s see how we can explore the dataframe a little bit! Try running the code samples below and try to figure out what they do. The solutions are provided under the toggles.
print(df.columns)
This returns a list-like structure of all the columns of your dataframe.
df.shape()
This returns a tuple that contains the shape of your DataFrame, in the format of (rows, columns).
len(df)
The length of your DataFrame as measured in the number of rows.
Conclusion
In this tutorial, you dove head-first into the wonderful world of Pandas. Pandas allows you to explore tabular data from a multitude of different sources using Python! As a quick recap of what you learned:
- Pandas is not part of the standard Python library and can be installed using
pip
orconda
- Pandas is conventionally imported using the alias
pd
- Pandas has two data structures: column-like
Series
and table-likeDataFrames
- You can read
.csv
files using thepd.read_csv()
function and Excel files using thepd.read_excel()
function
To learn more about the Pandas DataFrame structure, check out the official documentation here.
In the next section, you’ll learn even more ways to explore your data with Pandas.
Additional Resources
To learn more about related topics, check out the articles below: