In this tutorial, you’ll learn how to quickly summarize and analyze a Pandas DataFrame. By the end of this tutorial, you’ll have learned to take on some exploratory analysis of your dataset using pandas. You’ll learn how to calculate general attributes of your dataset, such as measures of central tendency or measures of dispersion. You’ll also learn how to count unique values and how to easily get summary statistics for your entire dataframe.
You’ll learn how to take on exploratory data analysis (or EDA), which is a critical first step in taking on any form of data analysis or machine learning. This process allows you to spot patterns and anomalies in your data. This allows you to build assumptions and start building tests to verify them. Learning these skills will be crucial for many other tasks in manipulating and working with data. Let’s get started!
Loading a Sample Pandas DataFrame
Let’s load a sample Pandas DataFrame that we use throughout the tutorial. The tutorial is hosted on our Github page and can be loaded by copying and pasting the code below. Let’s also print out the first five records of our DataFrame using the
# Loading our Pandas DataFrame df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv') print(df.head()) # Returns: # date gender region sales # 0 8/22/2022 Male North-West 20381 # 1 3/5/2022 Male North-East 14495 # 2 2/9/2022 Male North-East 13510 # 3 6/22/2022 Male North-East 15983 # 4 8/10/2022 Female North-West 15007
We can see here that we have four different columns:
datecolumn that holds the date of a transaction
regioncolumns that contain categorical variables
salescolumn that holds the amount of a sale
Let’s dive into doing some exploratory data analysis on our DataFrame!
Pandas Summary Functions
Pandas provides a multitude of summary functions to help us get a better sense of our dataset. These functions are smart enough to figure out whether we are applying these functions to a Series or a DataFrame. Pandas will automatically broadcast a summary method when it’s appropriate to do so.
In the following sections, you’ll explore only a handful of the key methods that are available to you. You’ll learn how to find the average of a column, the standard deviation and skew, as well as add up a column and get helpful summary statistics in one go.
Finding the Average of a Pandas DataFrame
Let’s start off with a simple calculation: calculating the mean (or average) of a Pandas DataFrame. Pandas provides a helpful method for this, the
.mean() method. We can apply this method to a single column or to multiple columns. Let’s see how this is done:
# Calculate the average for a single column print(df['sales'].mean()) # Returns: 19044.489
We can see here that applying the
.mean() method to a single column returns a scalar value. This means we can easily grab that value directly and assign it to a variable and use it in other places. What would happen if we applied the method to an entire dataframe? Pandas encourages us to identify that we only want to calculate the mean of numeric columns, by using the
numeric_only = True parameter.
# Calculate the average for an entire dataframe print(df.mean(numeric_only=True)) # Returns: # sales 19044.489 # dtype: float64
This actually returns a pandas Series – meaning that we can index out the values that we’re interested in. For example, if we were to print out
df.mean(numeric_only=True)['sales'], the same value would be returned as though we had calculated the mean only for the one column.
By default, Pandas will ignore missing values from being included in calculating the mean. However, we can opt to include them in our calculation by including
skipna=False as a parameter.
Finding the Standard Deviation of a Pandas DataFrame
Pandas also provide a helpful method for calculating the standard deviation. The standard deviation is a helpful measure in determining how spread out a dataset is. For example, a small standard deviation implies that the data are clustered closely together. Inversely, a large standard deviation implies that the data have a much larger spread.
In order to calculate the standard deviation using Pandas, we use the
.std() method. Similar to the
.mean() method, we can apply this method to a single column, to multiple columns, or to an entire DataFrame. This is one of the perks of using Pandas – many of the methods use a similar convention and allow you to pass in the same parameters. Let’s apply the method to the entire dataframe, asking it skip missing values and to only include numeric columns:
# Calculating the standard deviation for an entire dataframe print(df.std(numeric_only=True)) # Returns: # sales 5484.674161 # dtype: float64
Similar to our previous example, this method returns a Pandas series when applied to more than one column.
Finding the Skew of a Pandas DataFrame
Skewness measures the asymmetry of a normal distribution away from the distribution’s mean. A skewness value can be either positive or negative, depending on the directionality of the skew. The table below breaks down some common skewness ranges:
|Skewness Value||Direction||Degree of Skewness|
|> -1 and < -0.5||Left||Moderate|
|> -0.5 and < 0||Left||Approximately symmetric|
|> 0 and < 0.5||Right||Approximately symmetric|
|> 0.5 and < 1||Right||Moderate|
We can measure the skewness of a distraction using the
.skew() method. Similar to the examples above, we can calculate the skewness of a single column, multiple columns, and an entire DataFrame using the
.skew() method. Let’s see what happens when we calculate it for only the numeric columns:
# Calculating skewness using .skew() print(df.skew(numeric_only=True)) # Returns: # sales 0.827105 # dtype: float64
In the example above, the
sales data has a moderate positive skew, meaning that the data is right-leaning.
Reading the Pandas Documentation
Before we keep going in terms of providing different ways to explore and analyze your dataset, let’s take a look at reading the Pandas documentation. Why would this be important? Being able to read the documentation can seem a little daunting at first, but ends up being an invaluable resource.
Take a look at the documentation for the
.mean() method, which you can find here. Below, you’ll find a picture of the documentation, which we’ll use to gain an understanding of how to use them:
There are four main sections to the pandas documentation:
- Method Name: we can see here, for example that we’re looking at the DataFrame method (rather than the Series) method
- Description: this provides a plain English description of what the method does
- Parameters: the different parameters the method takes and how to work with them. We can see, for example, that Pandas will, by default, skip over missing data.
- Returns: what the method returns (i.e., what to expect)
Now it’s your turn! Try and search for a method that will add up the values of a column. Then, in your code editor, try and write a line of code add up all the values in the
sales column. Click the section below for the solution:
Get Summary Statistics with Pandas describe
In the previous sections, you learned how to calculate individual statistics, such as the mean or the standard deviation. While this approach works, there will be a lot of times where you’ll want to just get an overview of the dataset. This is where the Pandas
.describe() method comes into play.
The method provides a number of helpful statistics, such as the mean, standard deviation and quartiles of the data. Let’s try and use this method to see what’s returned:
# Using the Pandas .describe() method print(df.describe()) # Returns: # sales # count 1000.000000 # mean 19044.489000 # std 5484.674161 # min 6084.000000 # 25% 15628.000000 # 50% 17983.500000 # 75% 21612.500000 # max 43775.000000
.describe() method returns a treasure trove of information about numeric columns in our DataFrame. When the method is applied to a single column, a Series is returned. Meanwhile, when it’s applied to multiple columns, a DataFrame is returned.
Finding Unique Values in a Pandas DataFrame
Another common operation you’ll want to take on is identifying the different unique values that exist in a given column. Because data entry is often imperfect, we can use finding unique values as a way to suss out data quality of a dataset. It also gives us a sense of data completeness and any data integrity issues we may encounter.
Let’s take a look at an example: we may want to be able to see what the unique values are in the
region column For this, we can use
# Getting Unique Values in the region Column print(df['region'].unique()) # Returns: ['North-West' 'North-East' 'South']
We can see that the method returns a list of the unique values in our column. There are a lot more nuances about the method, but being able to understand how to get unique values is a great first step.
Creating Crosstabs of a Pandas DataFrame
In this final section, we’ll take a look at creating a crosstab of our dataset. A crosstab shows the relationship between two or more categorical variables by showing the number of records that fall into the cross-section of the two.
This can be helpful to illustrate how the data is distributed across different categorical or numerical variables.
# Creating Your First Crosstab tab = pd.crosstab( index=df['region'], columns=df['gender'], ) print(tab) # Returns: # gender Female Male # region # North-East 177 170 # North-West 161 161 # South 167 164
We can see that by default Pandas will provide the counts across these different categories. However, we can also modify this behavior to provide a different aggregation.
For example, we can pass in both a
values= parameter and an
aggfunc= parameter to override the normal behavior. Let’s, for example, look at the same two categories but provide a sum of all the sales:
# Adding Values Using a Pandas Crosstab tab = pd.crosstab( index=df['region'], columns=df['gender'], values=df['sales'], aggfunc='sum' ) print(tab) # Returns: # gender Female Male # region # North-East 3051132 2981835 # North-West 2455899 2457091 # South 4135688 3962844
We can see how useful this type of preliminary analysis can be! For example, we can identify that sales in the South region are significantly higher, while sales in the North-West region are the lowest.
It’s time to check your learning! Take a look at the exercises below and try and solve them. The solutions are all available by clicking the toggle button along with each question.
Conclusion and Recap
In this tutorial, you learned how to explore your data using a Pandas DataFrame. You learned how to use some methods to calculate some statistics as well as how to use the
.describe() method to generate some insight into the DataFrame. You also learned how to use the
.crosstab() method to calculate frequencies and other valuable insights. Finally, you also learned how to use the Pandas documentation to find and understand other helpful methods and functions.
To learn about related topics, check out the articles below: