In this tutorial, you’ll learn how to use the Pandas melt function to reshape wide data into a long data format. If you’re working with data in Python, reshaping and transforming your data is an essential skill, and that’s where Pandas melt()
comes into play. Cleaning your data and transforming it is an essential data science and data engineering skill.
This guide offers a clear and engaging introduction to this powerful function, packed with examples that cover everything from the basics to advanced usage.
By the end of this tutorial, you’ll have learned the following:
- What the Pandas
melt()
function is and its purpose in reshaping data - How to set up and use the pandas melt function, with a step-by-step walkthrough of its syntax and parameters
- The distinction between identifier variables, value variables, and measured variables, and how to choose the right combination for your needs
- How to customize column names in the melted DataFrame using id_vars and value_vars
- Advanced melting techniques such as working with MultiIndex DataFrames and integrating melt with other pandas operations
Want to learn more about transforming your Pandas DataFrames? Check out these resources below: Data Cleaning and Preparation in Pandas and Python, Transforming Pandas Columns with map and apply, Combine Data in Pandas with merge, join, and concat.
Table of Contents
Understanding the Pandas melt() Function
The Pandas melt()
function is used to convert a wide-format DataFrame into a long-format DataFrame.
This means that we’re taking an organized, pivot-style DataFrame where multiple columns represent different variables. We are then converting it into one where data are organized into a key-value structure (with variables in one column and their values in another).
Let’s dive in and examine the syntax for the Pandas melt function.
# Understanding the Pandas melt() Function
import pandas as pd
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name="value", col_level=None, ignore_index=True,)
We can see that the function offers quite a few different parameters, for which all but one have default arguments provided. The table below breaks down the parameters, their default arguments, a brief description and accepted values.
Parameter | Default Argument | Description | Accepted Values |
---|---|---|---|
frame | – | The input DataFrame to be melted | A Pandas DataFrame |
id_vars | None | Identifier variables that will remain unchanged in the resulting DataFrame | Scalar/list of Scalars/column index (integer) |
value_vars | None | Variables to transform into long format | Scalar/list of Scalars/column index (integer) |
var_name | None | Custom name for the resulting ‘variable’ column | A string representing the new column name |
value_name | “value” | Custom name for the resulting ‘value’ column | A string representing the new column name |
col_level | None | For MultiIndex DataFrames, the level of column to be melted | An integer/string/column index |
ignore_index | True | Adds a new index instead of keeping the previous index structure | Boolean: True/False |
melt()
functionBy default, when id_vars
, value_vars
, and col_level
are not provided, Pandas assumes all non-indentifier columns are value_vars and it melts the entire DataFrame.
Let’s now dive into how we can use the Pandas melt()
function by going through different examples.
Loading a Sample Pandas DataFrame
For this tutorial, we’ll be working with a simple Pandas DataFrame that we use in the Pandas pivot table tutorial. It’s been cut down a bit, in order to make understanding the function a little simpler.
Let’s take a look at how we can load it and print the first few records. The dataset contains information on different purchases of clothing types.
# Loading a Sample Pandas DataFrame
import pandas as pd
df = pd.DataFrame({'Date': ['2020-07-11','2020-09-23','2020-04-02','2020-02-28','2020-03-19'],
'Region': ['East', 'North', 'South', 'East', 'West'],
'Type': ['Children', 'Children', 'Women', 'Children', 'Women'],
'Units': [18.0, 14.0, 17.0, 26.0, 3.0],
'Sales': [306.0, 448.0, 425.0, 832.0, 33.0]})
print(df)
# Returns:
# Date Region Type Units Sales
# 0 2020-07-11 East Children 18.0 306.0
# 1 2020-09-23 North Children 14.0 448.0
# 2 2020-04-02 South Women 17.0 425.0
# 3 2020-02-28 East Children 26.0 832.0
# 4 2020-03-19 West Women 3.0 33.0
Now that you have a good sense of the data, let’s dive into using the function.
Using the Pandas melt() Function to Unpivot a DataFrame
In this section, we’ll focus on how to use the id_vars
parameter of the pandas melt
function to unpivot a DataFrame.
When it comes to melting a DataFrame, id_vars
refers to the identifier variables, i.e., the columns in your DataFrame that you want to keep unchanged, while the remaining columns are pivoted from wide to long format. These identifier variables provide context to the resulting melted DataFrame.
Let’s see how we can use Pandas to melt a DataFrame:
# Melting a DataFrame with id_vars
melted = pd.melt(
frame=df,
id_vars='Region'
)
print(melted)
# Returns:
# Region variable value
# 0 East Date 2020-07-11
# 1 North Date 2020-09-23
# 2 South Date 2020-04-02
# 3 East Date 2020-02-28
# 4 West Date 2020-03-19
# ...
# 15 East Sales 306.0
# 16 North Sales 448.0
# 17 South Sales 425.0
# 18 East Sales 832.0
# 19 West Sales 33.0
In the code block above, we specified that we wanted to use id_vars='Region'
. By specifying id_vars='Region'
, we are telling pandas to keep the ‘Region’ column unchanged while melting the rest of the columns in the DataFrame.
Similarly, the id_vars=
parameter accepts multiple column labels. Let’s explore this in the next section.
Using Multiple Identifier Variables in Pandas melt()
We can also melt a DataFrame using multiple variables. In order to use multiple columns in the Pandas melt function’s id_vars=
, we can pass in a list of column labels. This means that multiple columns will remain unchanged in the resulting DataFrame.
Let’s see how we can melt the DataFrame, keeping the Region
and Type
columns the same, by passing in a list of labels:
# Adding a Second id_vars
melted = pd.melt(
frame=df,
id_vars=['Region', 'Type']
)
print(melted)
# Returns:
# Region Type variable value
# 0 East Children Date 2020-07-11
# 1 North Children Date 2020-09-23
# 2 South Women Date 2020-04-02
# 3 East Children Date 2020-02-28
# 4 West Women Date 2020-03-19
# 5 East Children Units 18.0
# 6 North Children Units 14.0
# 7 South Women Units 17.0
# 8 East Children Units 26.0
# 9 West Women Units 3.0
# 10 East Children Sales 306.0
# 11 North Children Sales 448.0
# 12 South Women Sales 425.0
# 13 East Children Sales 832.0
# 14 West Women Sales 33.0
We can see in the resulting DataFrame above that the Region and Type columns have remained unchanged, but the other columns have been unpivoted.
Using Value Variables in the Pandas melt() Function
By default, Pandas will use all remaining columns in the value_vars=
parameter, meaning that all columns will be unpivoted. If we wanted to only melt (or unpivot) a single column, we could pass a column label into the value_vars=
parameter.
Let’s see how we can unpivot only the Sales column, broken out by the Region:
# Melting a DataFrame with value_vars
melted = pd.melt(
frame=df,
id_vars='Region',
value_vars='Sales'
)
print(melted)
# Returns:
# Region variable value
# 0 East Sales 306.0
# 1 North Sales 448.0
# 2 South Sales 425.0
# 3 East Sales 832.0
# 4 West Sales 33.0
In the example above, we kept the Region column as it was and only melted the Sales data. This means that we returned a single record for each region’s sales values.
Similarly, we can include multiple columns in the values field. Let’s see what this looks like in the next section.
Using Multiple Value Variables in Pandas melt()
By passing multiple column labels into the value_vars=
parameter of the Pandas melt()
function, you can un-pivot multiple columns. This will return key-value pairs for the two columns, broken out by the id_vars=
column(s).
Let’s take a look at what this looks like:
# Adding a Second Value Var
melted = pd.melt(
frame=df,
id_vars='Region',
value_vars=['Sales', 'Units']
)
print(melted)
# Returns:
# Region variable value
# 0 East Sales 306.0
# 1 North Sales 448.0
# 2 South Sales 425.0
# 3 East Sales 832.0
# 4 West Sales 33.0
# 5 East Units 18.0
# 6 North Units 14.0
# 7 South Units 17.0
# 8 East Units 26.0
# 9 West Units 3.0
We can see that the Region column has remained consistent. Two columns (variable and value) have been added, representing the key-value pairs for the Sales and Units columns.
In the following section, you’ll learn how to customize the names of the value and identifier variables using the Pandas melt()
function.
Customizing the Names of Value and Identifier Variables in Pandas melt()
By default, Pandas will assign the column names of variable
and value
when melting a DataFrame. We can override this behavior by using the var_name=
and value_name=
parameters.
Both of these parameters accept a single string as their input. Let’s see how we can customize these, by passing in Category and Amount as arguments.
# Naming id_vars and value_vars
melted = pd.melt(
frame=df,
id_vars='Region',
value_vars='Sales',
var_name='Category',
value_name='Amount'
)
print(melted)
# Returns:
# Region Category Amount
# 0 East Sales 306.0
# 1 North Sales 448.0
# 2 South Sales 425.0
# 3 East Sales 832.0
# 4 West Sales 33.0
In the resulting DataFrame above, we added var_name='Category
and value_name='Amount'
. This resulted in passing cleaner, more understandable names to our columns.
Retaining the Original DataFrame Index in Pandas melt()
By default, Pandas will override the former index. You can override this behavior by modifying the ignore_index=
parameter. By default, this is set to True
, but if we change it to False
, the original index will be retained. Let’s see what this looks like:
# Keeping the Original Index
melted = pd.melt(
frame=df,
id_vars='Region',
value_vars=['Sales', 'Units'],
ignore_index=False
)
print(melted)
# Returns:
# Region variable value
# 0 East Sales 306.0
# 1 North Sales 448.0
# 2 South Sales 425.0
# 3 East Sales 832.0
# 4 West Sales 33.0
# 0 East Units 18.0
# 1 North Units 14.0
# 2 South Units 17.0
# 3 East Units 26.0
# 4 West Units 3.0
In the example above, we modified the ignore_index=
parameter to False
. This means that the original index was retained (which explains why we’re seeing the indices of 0 through 4 repeated).
Using col_level in the Pandas melt Function with Multi-Index Columns
The col_level
parameter comes in handy when working with DataFrames containing MultiIndex columns. It helps you determine which level of the column index should be used when melting the DataFrame. Let’s explore this with a simple example:
# Working with col_level in Pandas melt
import pandas as pd
# Creating a sample DataFrame with MultiIndex columns
columns = pd.MultiIndex.from_tuples(
[('Sales', '2019'), ('Sales', '2020'), ('Profit', '2019'), ('Profit', '2020')],
names=['Category', 'Year']
)
data = [
[200, 250, 100, 120],
[300, 340, 150, 160],
[150, 180, 80, 90],
[250, 310, 120, 140],
]
index = ['North', 'South', 'East', 'West']
df = pd.DataFrame(data, index=index, columns=columns)
print("Original DataFrame:")
print(df)
# Melting a DataFrame with col_level
melted = pd.melt(
frame=df,
col_level='Year',
id_vars=None # In this case, we are not using id_vars
)
print(melted)
# Returns:
# Melted DataFrame:
# Year variable value
# 0 2019 NaN 200
# 1 2019 NaN 300
# 2 2019 NaN 150
# 3 2019 NaN 250
# 4 2020 NaN 250
# ...
In this example, we have a DataFrame with MultiIndex columns, where ‘Category’ (Level 0) represents ‘Sales’ and ‘Profit’, and ‘Year’ (Level 1) represents ‘2019’ and ‘2020’. When we set col_level='Year'
, the pandas melt
function uses the ‘Year’ level of the MultiIndex as the ‘variable’ column in the resulting melted DataFrame.
In summary, the col_level
parameter allows you to manage the melting process in DataFrames with MultiIndex columns, giving you more control over how your data is restructured when using the pandas melt
function.
Conclusion
Congratulations on reaching the end of this comprehensive tutorial on the Pandas melt function! We’ve covered a wide array of concepts, tools, and examples that explored the many features of this powerful data manipulation tool. To recap, here’s what we’ve learned:
- The purpose and functionality of the Pandas melt function in reshaping data from wide to long format
- How to set up and use the Pandas melt function effectively, delving into the syntax and parameters of the melt function
- Discerning the difference between identifier variables, value variables, and measured variables, and why choosing the right combination of id_vars and value_vars is critical
- Customizing column names in the melted DataFrame using id_vars, value_vars, var_name, and value_name parameters
- Advanced melting practices, such as working with MultiIndex DataFrames and integrating the melt function with other Pandas operations
With the knowledge gained from this tutorial, you are now well-equipped to tackle complex data manipulation tasks and reshape data for improved handling and analysis. Keep exploring further resources and practice your skills to become an expert in data restructuring using pandas. Remember, the key to mastering any tool is consistent practice and experimentation. Happy data melting!
Find out more about the melt function by checking out its official documentation.
Pingback: Pivot Tables in Pandas with Python for Python and Pandas • datagy