Skip to content

Selecting Columns in Pandas: Complete Guide

Pandas Select Columns Complete Guide Cover Image

In this tutorial, you’ll learn how to select all the different ways you can select columns in Pandas, either by name or index. You’ll learn how to use the lociloc accessors and how to select columns directly. You’ll also learn how to select columns conditionally, such as those containing a specific substring.

By the end of this tutorial, you’ll have learned:

  • How to select columns by name or by index
  • How to select all columns except for named columns
  • How to select columns of a specific datatype
  • How the .loc and .iloc accessors work to select data in Pandas
  • How to select columns conditionally, such as those containing a string

Let’s get started!

Loading a Sample Pandas DataFrame

To follow along with this tutorial, let’s load a sample Pandas DataFrame. By copying the code below, you’ll load a dataset that’s hosted on my Github page.

The DataFrame contains a number of columns of different data types, but few rows. This allows us to print out the entire DataFrame, ensuring us to follow along with exactly what’s going on.

import pandas as pd

df = pd.read_csv('')
print(df.head())

# Returns:
#       Name  Age Height  Score  Random_A  Random_B  Random_C  Random_D  Random_E
# 0      Joe   28    5'9     30        73        59         5         4        31
# 1  Melissa   26    5'5     32        30        85        38        32        80
# 2      Nik   31   5'11     34        80        71        59        71        53
# 3   Andrea   33    5'6     38        16        63        86        81        42
# 4     Jane   32    5'8     29        19        40        48         5        68

Let’s break down what we’ve done here:

  1. We imported pandas using the alias pd
  2. We then loaded a DataFrame using the pd.read_csv() function.

What Makes Up a Pandas DataFrame

Before diving into how to select columns in a Pandas DataFrame, let’s take a look at what makes up a DataFrame.

A DataFrame has both rows and columns. Each of the columns has a name and an index. For example, the column with the name 'Age' has the index position of 1.

As with other indexed objects in Python, we can also access columns using their negative index. For example, the column with the name 'Random_C' has the index position of -1.

What is Pandas Dataframe

Why Select Columns in Python?

The data you work with in lots of tutorials has very clean data with a limited number of columns. But this isn’t true all the time.

In many cases, you’ll run into datasets that have many columns – most of which are not needed for your analysis. In this case, you’ll want to select out a number of columns.

This often has the added benefit of using less memory on your computer (when removing columns you don’t need), as well as reducing the amount of columns you need to keep track of mentally.

How to Select a Single Column in Pandas

Pandas makes it easy to select a single column, using its name. We can do this in two different ways:

  1. Using dot notation to access the column
  2. Using square-brackets to access the column

Let’s see how we can do this by accessing the 'Name' column:

# Accessing a Single Column by Its Name
print(df['Name'])
# print(df.Name) (This returns the same value)

# Returns:
# 0        Joe
# 1    Melissa
# 2        Nik
# 3     Andrea
# 4       Jane
# Name: Name, dtype: object

Let’s take a quick look at why using the dot operator is often not recommended (while it’s easier to type). This is because you can’t:

  1. Select columns with spaces in the name,
  2. Use columns that have the same names as dataframe methods (such as ‘type’),
  3. Pick columns that aren’t strings, and
  4. Select multiple columns (as you’ll see later)

Now let’s take a look at what this actually returns. We can do this by using the type() function:

# Checking the Type of Our Column
print(type(df['Name']))

# Returns:
# <class 'pandas.core.series.Series'>

We can see that selecting a single column returns a Pandas Series.

If we wanted to return a Pandas DataFrame instead, we could use double square-brackets to make our selection. Let’s see what this looks like:

# Selecting a Single Column as a Pandas DataFrame
print(type(df[['Name']]))

# Returns:
# <class 'pandas.core.frame.DataFrame'>

What we’re actually doing here is passing in a list of columns to select. In this case, we’re passing in a list with a single item. When a list is passed in to the selector, a DataFrame is returned.

In the following section, you’ll learn how to select multiple columns in a Pandas DataFrame.

How to Select Multiple Columns in Pandas

Selecting multiple columns works in a very similar way to selecting a single column. We can pass a list of column names into our selection in order to select multiple columns. Because we need to pass in a list of items, the . notation is not possible when selecting multiple columns.

Let’s take a look at how we can select the the Name, Age, and Height columns:

# Selecting Multiple Columns in a Pandas DataFrame
selection = df[['Name', 'Age', 'Height']]
print(selection)

# Returns:
#       Name  Age Height
# 0      Joe   28    5'9
# 1  Melissa   26    5'5
# 2      Nik   31   5'11
# 3   Andrea   33    5'6
# 4     Jane   32    5'8

What’s great about this method, is that you can return columns in whatever order you want. If you wanted to switch the order around, you could just change it in your list:

# Selecting Columns in a Different Order
selection = df[['Name', 'Height', 'Age']]
print(selection)

# Returns:
#       Name Height  Age
# 0      Joe    5'9   28
# 1  Melissa    5'5   26
# 2      Nik   5'11   31
# 3   Andrea    5'6   33
# 4     Jane    5'8   32

In the next section, you’ll learn how to select columns by data type in Pandas.

How to Select Columns by Data Type in Pandas

In this section, you’ll learn how to select Pandas columns by specifying a data type in Pandas. This method allows you to, for example, select all numeric columns. This can be done by using the, aptly-named, .select_dtypes() method.

Let’s take a look at how we can select only text columns, which are stored as the 'object' data type:

# Selecting a Data Type with Pandas
selection = df.select_dtypes('object')
print(selection)

# Returns:
#       Name Height
# 0      Joe    5'9
# 1  Melissa    5'5
# 2      Nik   5'11
# 3   Andrea    5'6
# 4     Jane    5'8

The data types that are available can be a bit convoluted. The list below breaks down some of the common ones you may encounter:

  • Numeric data types:  np.number or 'number'
  • Strings: 'object'
  • Datetimes: np.datetime64'datetime' or 'datetime64'
  • Timedeltas: np.timedelta64'timedelta' or 'timedelta64'
  • Categorical: 'category'
  • Datetimes with Timezones: 'datetimetz' or 'datetime64[ns, tz]'

Using loc to Select Columns

The .loc accessor is a great way to select a single column or multiple columns in a dataframe if you know the column name(s).

This method is great for:

  • Selecting columns by column name,
  • Selecting columns using a single label, a list of labels, or a slice

The loc method looks like this:

Pandas Select Columns with loc

In the image above, you can see that you need to provide some list of rows to select. In many cases, you’ll want to select all rows (if your intention is to select columns). Because of this, you can pass in simply a colon (:), which selects all rows.

Let’s see how we can select all rows belonging to the name column, using the .loc accessor:

# Using .loc to Select Columns
print(df.loc[:, 'Name'])

# Returns:
# 0        Joe
# 1    Melissa
# 2        Nik
# 3     Andrea
# 4       Jane
# Name: Name, dtype: object

Now, if you wanted to select only the name column and the first three rows, you could write:

# Using .loc to Select Rows and Columns
selection = df.loc[:2,'Name']
print(selection)

# Returns:
# 0        Joe
# 1    Melissa
# 2        Nik

Selecting Multiple Columns with .loc in Pandas

Similarly, Pandas makes it easy to select multiple columns using the .loc accessor. We can include a list of columns to select. For example, if we wanted to select the 'Name' and 'Height' columns, we could pass in the list ['Name', 'Height'] as shown below:

# Selecting Multiple Columns Using .loc
print(df.loc[:, ['Name', 'Height']])

# Returns:
#       Name Height
# 0      Joe    5'9
# 1  Melissa    5'5
# 2      Nik   5'11
# 3   Andrea    5'6
# 4     Jane    5'8

We can also select a slice of columns using the .loc accessor. This works a little differently, as we don’t need to pass in a list, but rather a slice of column names. Say we wanted to select all columns from the 'Name' to 'Score' columns, we could write:

# Selecting a Slice of Columns in Pandas
print(df.loc[:, 'Name':'Score'])

# Returns:
#       Name  Age Height  Score
# 0      Joe   28    5'9     30
# 1  Melissa   26    5'5     32
# 2      Nik   31   5'11     34
# 3   Andrea   33    5'6     38
# 4     Jane   32    5'8     29

As a quick recap, the .loc accessor is great for selecting columns and rows by their names. In the following section, you’ll learn about the .iloc accessor, which lets you access rows and columns by their index position.

Using iloc to Select Columns

The iloc function is one of the primary way of selecting data in Pandas. The method “iloc” stands for integer location indexing, where rows and columns are selected using their integer positions.

This method is great for:

  • Selecting columns by column position (index),
  • Selecting columns using a single position, a list of positions, or a slice of positions

The standard format of the iloc method looks like this:

Pandas Select Columns with iloc

Now, for example, if we wanted to select the first two columns of our dataframe, we could write:

# Selecting a Range of Columns with .iloc
print(df.iloc[ : , :2])

# Returns:
#       Name  Age
# 0      Joe   28
# 1  Melissa   26
# 2      Nik   31
# 3   Andrea   33
# 4     Jane   32

Note that we didn’t write df.iloc[:,0:2], but that would have yielded the same result.

If we wanted to select all columns and only two rows with .iloc, we could do that by writing:

selection = df.iloc[:2,]
print(selection)

This returns:

      Name  Age Height  Score  Random_A  Random_B  Random_C  Random_D  Random_E
0      Joe   28    5'9     30        73        59         5         4        31
1  Melissa   26    5'5     32        30        85        38        32        80

How to Select Column Names Containing a String in Pandas

There may be times when you want to select columns that contain a certain string. This can, for example, be helpful if you’re looking for columns containing a particular unit. In our example below, we’re selecting columns that contain the string 'Random'.

Since the .loc accessor can accept a list of columns, we can write a list comprehension in the accessor to filter out column names meeting our condition. Let’s see what this looks like:

# Selecting Columns Meeting a Condition
print(df.loc[ : , [col for col in df.columns if 'Random' in col]])

# Returns:
#    Random_A  Random_B  Random_C  Random_D  Random_E
# 0        73        59         5         4        31
# 1        30        85        38        32        80
# 2        80        71        59        71        53
# 3        16        63        86        81        42
# 4        19        40        48         5        68

How to Select Columns Meeting a Condition

Similarly, we can select columns where the values meet a condition. We can also do this by using a list comprehension. In the comprehension, we’ll write a condition to evaluate against.

Say we wanted to filter down to only columns where any value is equal to 30. In this case, we could write the following:

# Selecting Columns Where Values Meet a Condition
print(df.loc[ : , [(df[col] == 30).any() for col in df.columns]])

# Returns:
#    Score  Random_A
# 0     30        73
# 1     32        30
# 2     34        80
# 3     38        16
# 4     29        19

Let’s break down what we did here:

  1. We used to .iloc accessor to access all rows
  2. We then used a list comprehension to select column names meeting a condition
  3. Our condition returns any column where any of the values are equal to 30 in that column

Copying Columns vs. Selecting Columns

Something important to note for all the methods covered above, it might looks like fresh dataframes were created for each. However, that’s not the case!

In Python, the equal sign (“=”), creates a reference to that object.

Because of this, you’ll run into issues when trying to modify a copied dataframe.

In order to avoid this, you’ll want to use the .copy() method to create a brand new object, that isn’t just a reference to the original.

To accomplish this, simply append .copy() to the end of your assignment to create the new dataframe.

For example, if we wanted to create a filtered dataframe of our original that only includes the first four columns, we could write:

new_df = df.iloc[:,:4].copy()
print(new_df)

This results in this code below:

      Name  Age Height  Score
0      Joe   28    5'9     30
1  Melissa   26    5'5     32
2      Nik   31   5'11     34
3   Andrea   33    5'6     38
4     Jane   32    5'8     29

This is incredibly helpful if you want to work the only a smaller subset of a dataframe.

Conclusion: Using Pandas to Select Columns

In this tutorial, you learned how to use Pandas to select columns. You learned how to use many different methods to select columns, including using square brackets to select a single or multiple columns. You then learned many different ways to use the .loc and .iloc accessors to select columns.

You learned some unique ways of selecting columns, such as when column names contain a string and when a column contains a particular value.

Additional Resources

To learn more about related topics, check out the tutorials below:

Leave a Reply

Your email address will not be published.