Skip to content

PyTorch AutoGrad: Automatic Differentiation for Deep Learning

PyTorch AutoGrad Automatic Differentiation for Deep Learning Cover Image

In this guide, you’ll learn about the PyTorch autograd engine, which allows your model to compute gradients. In deep learning, a fundamental algorithm is backpropagation, which allows your model to adjust its parameters according to the gradient of the loss function with respect to the given parameter.

Because of how important backpropagation is in deep learning, having a good understanding of the mechanics in PyTorch can make you a strong deep learning practitioner. By the end of this guide, you’ll have learned the following:

  • How PyTorch tracks gradients in its model parameters
  • How to access autograd values, such as a parameters gradient and weights
  • How to disable gradient tracking temporarily to update a model’s parameters
  • How autograd fits into the broader deep learning workflow

Tracking Gradients in PyTorch

In order to demonstrate tracking gradients in PyTorch, let’s take a look at a manual one-layer network. The network has an input array, X, and an output, y. Similarly, we’ll have parameters w and b, which represent the weight and bias of the function we hope to model.

PyTorch allows us to make the explicit choice when we create tensors as to whether or not we want to track gradients.

We’ll first define our input and output variables. For these, we don’t want to track gradients since they will remain constant.

# Import our Libraries and Create Data
import torch

X = torch.tensor([[1.0], [2.0], [3.0], [4.0]], dtype=torch.float32)
y = torch.tensor([[3.0], [5.0], [7.0], [9.0]], dtype=torch.float32)

We can see that this models a simple linear model (albeit with very few data points). We can see that this can be readily modelled as y = 2x + 1.

Let’s now create our model’s parameters – for these, we’ll want to track gradients. This will allow us to update the parameters as we train our network. To keep things simple, we’ll define these with 0 values, while in practice they’ll be randomly instantiated.

# Create a Weight and Bias Tensor
W = torch.tensor([[0.0]], requires_grad=True, dtype=torch.float32)
b = torch.tensor([[0.0]], requires_grad=True, dtype=torch.float32)

We can see that in the code block above, we added in the reuires_grad=True argument, which instructs PyTorch to allow tracking our gradients over time.

Running a Forward Pass to Get Gradients

Let’s now run a single forward pass through our model. This will allow us to predict some values, which will further allow us to calculate our loss (how far off our predictions are). By calculating the loss, we can calculate the gradients with respect to the loss and propagate this back through the network.

Let’s take a look at what our forward pass looks like, by first defining a forward function:

# Run a Single Forward Pass
def forward(X, W, b):
    return X.mm(W) + b

y_pred = forward(X, W, b)
print(y_pred)

# Returns:
# tensor([[0.],
#         [0.],
#         [0.],
#         [0.]], grad_fn=<AddBackward0>)

We can see in the code block above that we defined a simple forward() function. The function assumes a linear model and returns the result of multiplying out input by our weight tensor and adding our vias.

When we make our first predictions, we return a tensor of 0s. Note, however, that we’re also returning a gradient function! This allows us to now calculate the gradient of the loss.

In order to calculate the loss, let’s use the mean squared error, which will help amplify the effects of larger errors.

# Calculating the Loss
loss = torch.mean((y_pred - y) ** 2)
print(loss)

# Returns:
# tensor(41., grad_fn=<MeanBackward0>)

By calculating the mean squared error, we can see that we return the value of 41, but also another gradient function.

Let’s take a look at these gradient functions a little closer:

# Printing the Gradient Functions
print(f'Gradient Function for Prediction: {y_pred.grad_fn}')
print(f'Gradient Function for Loss: {loss.grad_fn}')

# Returns:
# Gradient Function for Prediction: <AddBackward0 object at 0x7f7e2864d910>
# Gradient Function for Loss: <MeanBackward0 object at 0x7f7e2864d9d0>

These loss gradient functions allow us to calculate the gradients for our losses in order to perform gradient descent.

Using PyTorch AutoGrad for Gradient Descent

Since they are objects, they come bundled with methods and attributes. One of these methods is the .backward() method, which allows us to calculate the gradient for each parameter.

Let’s apply the .backward() method to get gradients for each parameter. We can then access the gradient for each parameter by accessing the .grad parameter:

# Propagating the Loss Backwards
loss.backward()
print(f'Gradient for W: {W.grad}')
print(f'Gradient for b: {b.grad}')

# Returns:
# Gradient for W: tensor([[-35.]])
# Gradient for b: tensor([[-12.]])

In the code block above, we printed out the gradient for each of our model’s parameters. We can use these gradients to gather the direction and degree in which we need to update our model’s parameters.

We use a learning rate to control the degree to which parameters are adjusted. It determines the step size at each iteration when updating the model’s parameters during training.

When training a machine learning model, the goal is to minimize the loss function, which represents the discrepancy between the model’s predictions and the actual target values. The optimization process involves finding the values of the model’s parameters that lead to the smallest possible loss.

Let’s see how we can define a learning rate and then update our parameters:

# Update parameters manually using gradients
learning_rate = 0.001

with torch.no_grad():
    W -= learning_rate * W.grad
    b -= learning_rate * b.grad

In the example above, we define a learning rate of 0.001. We then use the torch.no_grad() context manager to update our parameters. We use torch.no_grad() as a context manager in the code to prevent PyTorch from tracking operations that happen inside this block. It is used specifically when we perform manual parameter updates, as we do not want these updates to be included in the computation graph for the purpose of backpropagation.

We use the augmented assignment operator to subtract our learning rate multiplied by the gradient.

We can then check the current values of our parameters by detaching them and printing the value, as shown below:

# Printing Our Learned Weights and Bias
print(f'Current value of W: {W.detach().item()}')
print(f'Current value of b: {b.detach().item()}')

# Returns:
# Current value of W: 0.03500000014901161
# Current value of b: 0.012000000104308128

We can see that our weight and bias are being gently nudged in the right direction. Additionally, we can see that the value of the weight is actually increasing quicker than the bias (since we expect it to be higher, meaning the loss is higher).

The next step now is to reset our gradients for each parameter, so that they are not influenced inadvertently in subsequent training steps. We can do this by manually updating the gradients to None, as shown below:

# Resetting Our Gradients for Next Training
W.grad = None
b.grad = None

Now that we have a strong understanding of how this process works, let’s put it all together and follow some more PyTorch conventions to simplify our process.

PyTorch AutoGrad: Putting It All Together

In this section, you’ll learn how to follow PyTorch conventions to implement autograd in a simple, Pythonic way. Let’s first define some functions to move through our model and calculate a loss:

# Define a Simple Forward Function and Loss Function (MSE)
def forward(X, W, b):
    return X.mm(W) + b

def loss_fn(y_pred, y_true):
    return torch.mean((y_pred - y_true) ** 2)

In the code block below, we’ll also instantiate some additional parameters, after importing the torch.optim module:

  • num_epochs=200 defines how many times we want the model to see all of our data,
  • print_every=20 defines how often we want to print results from our training

We then also create an optimizer, passing in our parameters and learning rate. This allows us to not need to update our parameters manually. This becomes more and more important, the more parameters our model has.

Let’s see what this looks like:

import torch
import torch.optim as optim

learning_rate = 0.001
num_epochs = 200
print_every = 20

X = torch.tensor([[1.0], [2.0], [3.0], [4.0]], dtype=torch.float32)
y = torch.tensor([[3.0], [5.0], [7.0], [9.0]], dtype=torch.float32)

W = torch.tensor([[0.0]], requires_grad=True, dtype=torch.float32)
b = torch.tensor([[0.0]], requires_grad=True, dtype=torch.float32)

optimizer = optim.SGD(params=[W, b], lr=learning_rate)

for epoch in range(1, num_epochs + 1):
    # Zero Out Gradients for Each Epoch
    W.grad = None
    b.grad = None

    # Run a Forward Pass
    y_pred = forward(X, W, b)

    # Calculate the Loss
    loss = loss_fn(y_pred, y)

    # Propegate the Loss Backwards
    loss.backward()

    # Update Parameters
    optimizer.step()
    # Same as: 
    # with torch.no_grad():
    #     W -= learning_rate * W.grad
    #     b -= learning_rate * b.grad
    

    if epoch % print_every == 0:
        print(f'Epoch: {epoch:02} - MSE Loss: {loss:.2f}')

# Returns:
# Epoch: 20 - MSE Loss: 21.63
# Epoch: 40 - MSE Loss: 11.03
# Epoch: 60 - MSE Loss: 5.63
# Epoch: 80 - MSE Loss: 2.88
# Epoch: 100 - MSE Loss: 1.47
# Epoch: 120 - MSE Loss: 0.76
# Epoch: 140 - MSE Loss: 0.39
# Epoch: 160 - MSE Loss: 0.21
# Epoch: 180 - MSE Loss: 0.11
# Epoch: 200 - MSE Loss: 0.06

In the code block above, we use a training loop to develop our PyTorch model. You’ll notice that we use optimizer.step() to replace updating our model’s values manually.

After our 200 epochs, we can now see what our weights and bias are approaching what we expect them to be. Note that had we had more data, our model would be more likely to get trained better. Let’s see what our values look like now:

# Printing Learned Weights and Biases
print(f'Current value of W: {W.detach().item()}')
print(f'Current value of b: {b.detach().item()}')

# Returns:
# Current value of W: 2.0195605754852295
# Current value of b: 0.7054842114448547

We can see that even with small amounts of data that PyTorch was able to learn quite a bit about our data!

Conclusion

In conclusion, this guide has provided a comprehensive overview of the PyTorch autograd engine and its significance in deep learning. Backpropagation, a fundamental algorithm in deep learning, enables models to adjust their parameters based on the gradients of the loss function. Understanding the mechanics of PyTorch’s autograd can empower you as a strong deep learning practitioner.

Throughout the guide, you’ve learned the following key concepts:

  1. How PyTorch tracks gradients in model parameters, allowing for automatic differentiation during backpropagation.
  2. How to access autograd values, such as parameter gradients and weights.
  3. How to temporarily disable gradient tracking to update model parameters.
  4. How autograd fits into the broader deep learning workflow, facilitating optimization and training of neural networks.

The guide demonstrated the process of tracking gradients in a simple one-layer network, allowing you to calculate the loss and perform gradient descent for parameter updates. As the training process progressed, you observed how the weights and bias values were adjusted to minimize the loss function, effectively improving the model’s predictions.

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

Leave a Reply

Your email address will not be published. Required fields are marked *