Introduction to Pandas for Data Science

Introduction to Pandas for Data Science Cover Image

In this tutorial, you’ll learn how to dive into the wonderful world of Pandas. Pandas is a Python package that provides fast and flexible data structures used for data manipulation and analysis. By the end of this tutorial, you’ll have learned how to:

  • Install pandas for Python using pip or conda
  • Understand the pandas series and dataframe objects
  • Create a dataframe from scratch, and
  • Import a dataframe from a .csv file or an .xls file

What is Pandas for Python?

Pandas is a Python package that allows you to work with tabular data and provides many helpful methods and functions to help you manipulate and analyze your data. Python is incredibly well suited to work with many different types of data (such as strings, integers, dates and times) in a tabular format.

It can read data from many different sources, such as from the internet, from Excel files, and SQL databases. Data can be ordered or underordered and supported by different data types. The library also handles missing data incredibly well and allows you to update, insert, and delete data using vectorized formats.

Data are stored in two main data structures, Pandas series and Pandas dataframe objects. These structures can easily be sorted, filtered, and merged with other data.

All that to say, Pandas is an incredible addition to the Python environment. It’s supported in a wide spectrum of domains (including finance, data science, and science) and is recognized as one of the pillar libraries of Python. Because of its ease of use and wide applications for data science, you’ll learn about Pandas over the next few sections!

How Do You Install Pandas in Python?

pandas is not part of the standard Python library, because of this you’ll need to install it first. Thankfully, this is quite easy using either pip or conda. We can do this by writing either of these lines into our terminal or command line, depending on your preference for managing packages:

pip install pandas
# conda install pandas

As you run through the installation, you’ll notice that a lot of dependencies will also be installed. Once you have successfully installed pandas, you can import the library as you normally would import a package. Conventionally, pandas is imported with the alias of pd:

# Importing pandas
import pandas as pd

If this code runs without any errors being raised, then your installation of pandas went successfully! In the following section, you’ll learn about the Pandas series data structure.

What is a Pandas Series?

The Pandas series object is a one-dimensional data structure. You can think of this as being comparable to a column in a table. Pandas describes this as a one-dimensional homogenously-typed array. This means that the data are aligned along a single aix and are of the same data type.

A Pandas series is a class of data belonging that, well, contains data, is indexed, and has a particular data type. Let’s see how we can create a Pandas series:

# Creating a Pandas series
import pandas as pd

sample_series = pd.Series()
print(sample_series)

# Returns: Series([], dtype: float64)

We can see that the series contains a list-like structure. We can see what this looks like in more in more detail by creating a series that contains data:

# Creating a Pandas series with data
sample_series = pd.Series(['Nik', 'Kate', 'Jane', 'Jim'])
print(sample_series)

# Returns: 
# 0     Nik
# 1    Kate
# 2    Jane
# 3     Jim
# dtype: object

We can see here that our series contains the data that we passed into it. in this case, we passed in a list and the data was converted into a series object. On the left, we see our index. By default, Pandas will use the index from 0 through the end of the series. Let’s see how we can pass in a series index as well:

# Creating a Pandas Series with an Index
sample_series = pd.Series(data = [33, 32, 40, 20], index=['Nik', 'Kate', 'Jane', 'Jim'])
print(sample_series)

# Returns: 
# Nik     33
# Kate    32
# Jane    40
# Jim     20
# dtype: int64

In the example above, we created a Pandas series by:

  1. Calling the Series constructor
  2. Passing in the agument of data, which allows us, in this case, to pass in a list of data
  3. Passing in the index argument, which allows us to pass in a list of indices

We can see that this is much, much different than creating a normal Python list. We can see that each item in our series is accessible now by its location (similar to a Python list), but also by its named index. For example, we can now access a person’s age by simply indexing the name provided in the index:

# Accessing a Pandas series item
print(sample_series['Nik'])

# Returns: 33

It’s important that we pass in arrays of the same length when creating an index and data. If we don’t Python will raise a ValueError:

# Raising a ValueError When Creating a Series
sample_series = pd.Series(data = [33, 32, 40, 20, 33], index=['Nik', 'Kate', 'Jane', 'Jim'])

# Raises: ValueError: Length of values (5) does not match length of index (4)

Pandas Series are a key building block of the Pandas dataframe. In the next section, you’ll dive into the world of Pandas dataframes!

What is a Pandas DataFrame?

A Pandas dataframe, put simply, is a table. The dataframe contains both rows and columns, each of which are labelled. The DataFrame object contains individual records, each containing different values. Each value in a DataFrame corresponds to both a row (a record) or a column.

Why Have Two Data Structures?

In the previous section, you learned about the Pandas Series. We equated the Pandas Series as being related to a column in a table. Similarly, we can think of a DataFrame as a container for Pandas’s Series objects.

Why would Pandas created two data structures, when really what we want is the tabular DataFrame? The two structures allow us to work flexibly and add and remove data in a dictionary-like fashion. It also allows us to achieve incredible flexibility while working with data. We don’t need to worry about the size or dimensionality of the data and it allows us to access, modify and retrieve data with great flexibility.

Creating a Pandas DataFrame

Let’s now take a look at how we can create a Pandas DataFrame. In the previous section, you learned to consider a Pandas DataFrame as a collection of Pandas Series objects. One of the ways in which we can create a DataFrame is by passing in a dictionary of data. Each key represents the column of our DataFrame and the key’s value represents a list of data belonging to that column.

Let’s create a small DataFrame to practice:

# Creating a DataFrame from Scratch
df = pd.DataFrame({
    'Name': ['Nik', 'Kate', 'Jane', 'Evan', 'Jim', 'Moe', 'Samira'],
    'Age': [33, 32, 20, 40, 22, 50, 76]
})

print(df)

# Returns:
#      Name  Age
# 0     Nik   33
# 1    Kate   32
# 2    Jane   20
# 3    Evan   40
# 4     Jim   22
# 5     Moe   50
# 6  Samira   76

Great work! You just created your first Pandas DataFrame. We can see here that we created a new Pandas DataFrame using a dictionary.

Using Pandas Methods to Display Data

Now that you’ve created your first dataframe, let’s take a look at a couple of ways in which we can use Pandas to display our data. While you’ll learn significantly more about selecting data in Pandas in the subsequent tutorials, let’s start looking at just the first or last few rows of our dataset.

We can use the Pandas .head() DataFrame method to access the first five records of a DataFrame. Let’s see what this looks like:

# Returning the first five records of a Pandas DataFrame
print(df.head())

# Returns:
#      Name  Age
# 0     Nik   33
# 1    Kate   32
# 2    Jane   20
# 3    Evan   40
# 4     Jim   22

The head method also allows us to insert a different argument for the number of records to return. By default, it’ll return the first five. However, say we wanted to return only two records, we could pass in the value of 2 into the method:

# Returning the first two records of a Pandas DataFrame
print(df.head(2))

# Returns:
#      Name  Age
# 0     Nik   33
# 1    Kate   32

Similarly, we can return the last records of a dataframe using the .tail() method. Similar to the .head() method, the .tail() method will return the last five records. Passing in another integer value will modify this behaviour. Let’s take a look at how we can return the last three records:

# Returning the last three records of a Pandas DataFrame
print(df.tail(3))

# Returns:
#      Name  Age
# 4     Jim   22
# 5     Moe   50
# 6  Samira   76

In the next section, you’ll learn how to create a Pandas DataFrame from a file.

Import a Pandas Dataframe from a File

Creating a DataFrame from scratch can be quite a bit of work and you likely already have files that you want convert into a Pandas DataFrame. Because of this, you’ll learn how to do just this in this section!

One of the most common file formats you’ll encounter is a .csv file. These files are data stored as text, such as shown below:

date,gender,region,sales
8/22/2022,Male,North-West,20381
3/5/2022,Male,North-East,14495
2/9/2022,Male,North-East,13510
6/22/2022,Male,North-East,15983

You can see here that the first five rows of our sample .csv files really are just text. CSV stands for comma-seperated values – which makes sense, given that our values are, in fact, separated by commas. While it’s not a requirement for a comma to be the delimiter between data, it is a common convention. The first row represents the column headers and the following rows are our data.

Pandas makes reading a .csv file into a DataFrame quite easy! The library comes with an aptly-named function, .read_csv(), which allows you to read a .csv file into a dataframe. The first argument is the path to the file, which can either be a relative path, an absolute path, or even a URL. In this case, we’re going to be loading a DataFrame from a file hosted on github.

Let’s load a .csv file into a DataFrame:

# Loading a csv file into a Pandas DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv')
print(df.head())

# Returns:
#         date  gender      region  sales
# 0  8/22/2022    Male  North-West  20381
# 1   3/5/2022    Male  North-East  14495
# 2   2/9/2022    Male  North-East  13510
# 3  6/22/2022    Male  North-East  15983
# 4  8/10/2022  Female  North-West  15007

While the .read_csv() function takes a number of different parameters, the most important one, of course, is the path to the file.

Pandas also provides a dedicated function for reading Excel files, both in .xls and .xlsx formats! This function is the, equally aptly-named, .read_excel() function. In other sections of this course, you’ll learn different ways of reading data.

Exercises

Now that you’ve a dataframe from the internet, let’s see how we can explore the dataframe a little bit! Try running the code samples below and try to figure out what they do. The solutions are provided under the toggles.

This returns a list-like structure of all the columns of your dataframe.

This returns a tuple that contains the shape of your DataFrame, in the format of (rows, columns).

The length of your DataFrame as measured in the number of rows.

Conclusion

In this tutorial, you dove head-first into the wonderful world of Pandas. Pandas allows you to explore tabular data from a multitude of different sources using Python! As a quick recap of what you learned:

  • Pandas is not part of the standard Python library and can be installed using pip or conda
  • Pandas is conventionally imported using the alias pd
  • Pandas has two data structures: column-like Series and table-like DataFrames
  • You can read .csv files using the pd.read_csv() function and Excel files using the pd.read_excel() function

To learn more about the Pandas DataFrame structure, check out the official documentation here.

In the next section, you’ll learn even more ways to explore your data with Pandas.

Additional Resources

To learn more about related topics, check out the articles below: