Skip to content

Summarizing and Analyzing a Pandas DataFrame

Summarizing and Analyzing a Pandas DataFrame Cover image

In this tutorial, you’ll learn how to quickly summarize and analyze a Pandas DataFrame. By the end of this tutorial, you’ll have learned to take on some exploratory analysis of your dataset using pandas. You’ll learn how to calculate general attributes of your dataset, such as measures of central tendency or measures of dispersion. You’ll also learn how to count unique values and how to easily get summary statistics for your entire dataframe.

You’ll learn how to take on exploratory data analysis (or EDA), which is a critical first step in taking on any form of data analysis or machine learning. This process allows you to spot patterns and anomalies in your data. This allows you to build assumptions and start building tests to verify them. Learning these skills will be crucial for many other tasks in manipulating and working with data. Let’s get started!

Loading a Sample Pandas DataFrame

Let’s load a sample Pandas DataFrame that we use throughout the tutorial. The tutorial is hosted on our Github page and can be loaded by copying and pasting the code below. Let’s also print out the first five records of our DataFrame using the .head() function.

# Loading our Pandas DataFrame
df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv')
print(df.head())

# Returns:
#         date  gender      region  sales
# 0  8/22/2022    Male  North-West  20381
# 1   3/5/2022    Male  North-East  14495
# 2   2/9/2022    Male  North-East  13510
# 3  6/22/2022    Male  North-East  15983
# 4  8/10/2022  Female  North-West  15007

We can see here that we have four different columns:

  1. A date column that holds the date of a transaction
  2. gender and region columns that contain categorical variables
  3. A sales column that holds the amount of a sale

Let’s dive into doing some exploratory data analysis on our DataFrame!

Pandas Summary Functions

Pandas provides a multitude of summary functions to help us get a better sense of our dataset. These functions are smart enough to figure out whether we are applying these functions to a Series or a DataFrame. Pandas will automatically broadcast a summary method when it’s appropriate to do so.

In the following sections, you’ll explore only a handful of the key methods that are available to you. You’ll learn how to find the average of a column, the standard deviation and skew, as well as add up a column and get helpful summary statistics in one go.

Finding the Average of a Pandas DataFrame

Let’s start off with a simple calculation: calculating the mean (or average) of a Pandas DataFrame. Pandas provides a helpful method for this, the .mean() method. We can apply this method to a single column or to multiple columns. Let’s see how this is done:

# Calculate the average for a single column
print(df['sales'].mean())

# Returns: 19044.489

We can see here that applying the .mean() method to a single column returns a scalar value. This means we can easily grab that value directly and assign it to a variable and use it in other places. What would happen if we applied the method to an entire dataframe? Pandas encourages us to identify that we only want to calculate the mean of numeric columns, by using the numeric_only = True parameter.

# Calculate the average for an entire dataframe
print(df.mean(numeric_only=True))

# Returns: 
# sales    19044.489
# dtype: float64

This actually returns a pandas Series – meaning that we can index out the values that we’re interested in. For example, if we were to print out df.mean(numeric_only=True)['sales'], the same value would be returned as though we had calculated the mean only for the one column.

By default, Pandas will ignore missing values from being included in calculating the mean. However, we can opt to include them in our calculation by including skipna=False as a parameter.

Finding the Standard Deviation of a Pandas DataFrame

Pandas also provide a helpful method for calculating the standard deviation. The standard deviation is a helpful measure in determining how spread out a dataset is. For example, a small standard deviation implies that the data are clustered closely together. Inversely, a large standard deviation implies that the data have a much larger spread.

In order to calculate the standard deviation using Pandas, we use the .std() method. Similar to the .mean() method, we can apply this method to a single column, to multiple columns, or to an entire DataFrame. This is one of the perks of using Pandas – many of the methods use a similar convention and allow you to pass in the same parameters. Let’s apply the method to the entire dataframe, asking it skip missing values and to only include numeric columns:

# Calculating the standard deviation for an entire dataframe
print(df.std(numeric_only=True))

# Returns: 
# sales    5484.674161
# dtype: float64

Similar to our previous example, this method returns a Pandas series when applied to more than one column.

Finding the Skew of a Pandas DataFrame

Skewness measures the asymmetry of a normal distribution away from the distribution’s mean. A skewness value can be either positive or negative, depending on the directionality of the skew. The table below breaks down some common skewness ranges:

Skewness ValueDirectionDegree of Skewness
< -1LeftHigh
> -1 and < -0.5LeftModerate
> -0.5 and < 0LeftApproximately symmetric
0N/ACompletely symmetric
> 0 and < 0.5RightApproximately symmetric
> 0.5 and < 1RightModerate
> 1RightHigh
Providing an overview of relative skew values

We can measure the skewness of a distraction using the .skew() method. Similar to the examples above, we can calculate the skewness of a single column, multiple columns, and an entire DataFrame using the .skew() method. Let’s see what happens when we calculate it for only the numeric columns:

# Calculating skewness using .skew()
print(df.skew(numeric_only=True))

# Returns: 
# sales    0.827105
# dtype: float64

In the example above, the sales data has a moderate positive skew, meaning that the data is right-leaning.

Reading the Pandas Documentation

Before we keep going in terms of providing different ways to explore and analyze your dataset, let’s take a look at reading the Pandas documentation. Why would this be important? Being able to read the documentation can seem a little daunting at first, but ends up being an invaluable resource.

Take a look at the documentation for the .mean() method, which you can find here. Below, you’ll find a picture of the documentation, which we’ll use to gain an understanding of how to use them:

The documentation for the Pandas .mean() method

There are four main sections to the pandas documentation:

  1. Method Name: we can see here, for example that we’re looking at the DataFrame method (rather than the Series) method
  2. Description: this provides a plain English description of what the method does
  3. Parameters: the different parameters the method takes and how to work with them. We can see, for example, that Pandas will, by default, skip over missing data.
  4. Returns: what the method returns (i.e., what to expect)

Now it’s your turn! Try and search for a method that will add up the values of a column. Then, in your code editor, try and write a line of code add up all the values in the sales column. Click the section below for the solution:

Since you’re hoping to add up all the values in a given column or across the entire DataFrame, we know that we’re hoping to calculate the sum. When we search for sum, a number of different items are returned, including the pandas.DataFrame.sum page. Here, we can see that we can simply apply the method to either the DataFrame or to the column.

print(df['sales'].sum())

Get Summary Statistics with Pandas describe

In the previous sections, you learned how to calculate individual statistics, such as the mean or the standard deviation. While this approach works, there will be a lot of times where you’ll want to just get an overview of the dataset. This is where the Pandas .describe() method comes into play.

The method provides a number of helpful statistics, such as the mean, standard deviation and quartiles of the data. Let’s try and use this method to see what’s returned:

# Using the Pandas .describe() method
print(df.describe())

# Returns:
#               sales
# count   1000.000000
# mean   19044.489000
# std     5484.674161
# min     6084.000000
# 25%    15628.000000
# 50%    17983.500000
# 75%    21612.500000
# max    43775.000000

The .describe() method returns a treasure trove of information about numeric columns in our DataFrame. When the method is applied to a single column, a Series is returned. Meanwhile, when it’s applied to multiple columns, a DataFrame is returned.

Finding Unique Values in a Pandas DataFrame

Another common operation you’ll want to take on is identifying the different unique values that exist in a given column. Because data entry is often imperfect, we can use finding unique values as a way to suss out data quality of a dataset. It also gives us a sense of data completeness and any data integrity issues we may encounter.

Let’s take a look at an example: we may want to be able to see what the unique values are in the region column For this, we can use .unique() method.

# Getting Unique Values in the region Column
print(df['region'].unique())

# Returns: ['North-West' 'North-East' 'South']

We can see that the method returns a list of the unique values in our column. There are a lot more nuances about the method, but being able to understand how to get unique values is a great first step.

Creating Crosstabs of a Pandas DataFrame

In this final section, we’ll take a look at creating a crosstab of our dataset. A crosstab shows the relationship between two or more categorical variables by showing the number of records that fall into the cross-section of the two.

This can be helpful to illustrate how the data is distributed across different categorical or numerical variables.

# Creating Your First Crosstab
tab = pd.crosstab(
    index=df['region'],
    columns=df['gender'],
)

print(tab)

# Returns:
# gender      Female  Male
# region                  
# North-East     177   170
# North-West     161   161
# South          167   164

We can see that by default Pandas will provide the counts across these different categories. However, we can also modify this behavior to provide a different aggregation.

For example, we can pass in both a values= parameter and an aggfunc= parameter to override the normal behavior. Let’s, for example, look at the same two categories but provide a sum of all the sales:

# Adding Values Using a Pandas Crosstab
tab = pd.crosstab(
    index=df['region'],
    columns=df['gender'],
    values=df['sales'],
    aggfunc='sum'
)

print(tab)

# Returns:
# gender       Female     Male
# region                      
# North-East  3051132  2981835
# North-West  2455899  2457091
# South       4135688  3962844

We can see how useful this type of preliminary analysis can be! For example, we can identify that sales in the South region are significantly higher, while sales in the North-West region are the lowest.

Exercises

It’s time to check your learning! Take a look at the exercises below and try and solve them. The solutions are all available by clicking the toggle button along with each question.

While searching through the pandas documentation, you might come across the .max() method, which returns the maximum value for a given series. You can apply the method to the sales column, as shown below:

# Finding the highest sale value
print(df['sales'].min())
# Returns: 6084

Use the .describe() method to calculate some high level statistics and then access the ['75%'] value:

# Access the 75th percentile of the sales column

print(df['sales'].describe()['75%'])
# Returns: 21612.5

One way that you can find this out is by using a crosstab and finding the intersection between Female and South:

# Determining the number of sales made by women in South region
print(pd.crosstab(
    index=df['region'],
    columns=df['gender']
).loc['South', 'Female'])

# Returns: 167

Because the .unique() method returns a list, we can easily pass this list into the len() function to calculate the length of the list.

# Count the number of unique items in the region column
print(len(df['region'].unique()))

# Returns: 3

Conclusion and Recap

In this tutorial, you learned how to explore your data using a Pandas DataFrame. You learned how to use some methods to calculate some statistics as well as how to use the .describe() method to generate some insight into the DataFrame. You also learned how to use the .crosstab() method to calculate frequencies and other valuable insights. Finally, you also learned how to use the Pandas documentation to find and understand other helpful methods and functions.

Additional Resources

To learn about related topics, check out the articles below:

1 thought on “Summarizing and Analyzing a Pandas DataFrame”

  1. How to display image stored in pandas dataframe?
    import pandas as pd
    from scipy import misc
    import numpy as np
    import matplotlib.pyplot as plt

    W = {‘img’:[misc.imread(‘pic.jpg’)]}
    df = pd.DataFrame(W)

    # This displays the image
    plt.imshow(df.img1[0])
    plt.show()

    df.to_csv(‘mypic.csv’)
    new_df= pd.read_csv(‘mypic.csv’)

    # This does not display the image
    plt.imshow(new_df.img1[0])
    plt.show()

Leave a Reply

Your email address will not be published.