Skip to content

How to Calculate R-Squared in Python (SkLearn and SciPy)

How to Calculate R-Squared in Python (SkLearn and SciPy) Cover Image

Welcome to our exploration of R-squared (R2), a powerful metric in statistics that assesses the goodness of fit in regression models. R2 represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). In this post, we’ll guide you through the essentials of R2 and demonstrate how to calculate it using popular Python libraries such as scikit-learn (sklearn) and SciPy.

By the end of this post, you’ll have learned the following:

  • What R2 signifies in the context of regression analysis
  • How to calculate the R2 value in Scikit-Learn
  • How to calculate the R2 value in SciPy

Understanding the R-Squared (Coefficient of Determination) in Regression Analysis

In regression analysis, the R-squared (R2) value plays an important role in evaluating the performance of a model. The R-squared metric quantifies the proportion of the variance in the dependent variable that can be explained by the independent variable(s).

The value of R2 will range from 0 to 1, where a higher value indicates a better fit. A better fit, in this case, implies that a larger percentage of the variability in the dependent variables is captured by the model.

Let’s take a look at a visual example. Consider the graph below, where we have plotted our variables x and y using a scatterplot. We have then added a line of best fit, which we calculated using linear regression methods in Scikit-Learn.

Plotting a Line of Best Fit to Demonstrate R-Squared in Python

In the image above, there are a few things to take note of:

  • Our x and y variables seem to form a linear relationship, which is plotted with a red line of best fit
  • The grey dashed lines show how far each data point is away from the line of best fit

We can see that while the line explains the general trend of the data, it’s not perfect. We can also see that the R2 value is included in the title. Because our value is 0.79, we can understand that 79% of the variable in the dependent variable is accounted for by the independent variable. This means that 21% of the variance is still unexplained!

While a high R-squared value is generally desirable, it doesn’t always inform whether a model has shortcomings or not. For example, it doesn’t inform us about the correctness of the model’s specifications or the reliability its predictions. Instead, the value focuses only on the proportion of variance explained.

Now that you have a good understanding of what the r-squared represents, let’s dive into how to calculate it using sklearn, the popular machine learning library.

How to Calculate R-Squared in Scikit-Learn

The popular machine-learning library, Scikit-Learn makes it easy to work with linear regression models and to calculate the r-squared value. In order to do this, we first need to create a Linear Regression model, which will allow us to generate a line of best fit.

In order to do this, we can first create some fake data, create a Linear Regression model, and use it to generate our line of best fit.

# Creating a Linear Model with Scikit-Learn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(25, 1)  
y = 4 + 3 * X + np.random.randn(25, 1)

# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

Since there’s quite a lot going on in the code block above, let’s break it down:

  1. We first import our libraries and generate random data using a random seed. We created both 25 x and 25 y variables.
  2. We then instantiate a Linear Regression class and fit our data to it. This generates the line of best fit and other metrics associated with it.
  3. We also used the model to make some predictions, which we saved into the y_pred variable

Now that we have our predictions, we can calculate the r-squared value using the r2_score function from scikit-learn. Let’s see how this works:

# How to Calculate the R-Squared Value in Scikit-Learn
r2 = r2_score(y, y_pred)
print(r2)

# Returns: 0.7918569668880392

In the code block above, we passed our original y-values and our predicted y-values into the r2_score() function, which returns the r-squared value.

Now that you know how to calculate the r-squared in scikit-learn, let’s explore how you can use SciPy to calculate the important metric.

How to Calculate R-Squared in SciPy

In this section, we’ll explore how to calculate the R-squared value using the SciPy library. This implementation is a bit simpler, if you don’t need a full model to work with.

Using the linregress() function from the SciPy library, we can easily calculate important metrics, such as:

  • The slope,
  • The y-intercept or bias,
  • The r-value,
  • The p-value, and
  • The standard error

Once we have the r-value returned, we can calculate r-squared by simply squaring this value. Let’s take a look at what this looks like:

# How to Calculate R-Squared Using SciPy
import numpy as np
from scipy.stats import linregress

# Generate fake data
np.random.seed(42)
X = 2 * np.random.rand(25, 1) 
y = 4 + 3 * X + np.random.randn(25, 1) 

# Fit a linear regression model using linregress
slope, intercept, r_value, p_value, std_err = linregress(X.flatten(), y.flatten())

# Calculate R-squared
r_squared = r_value**2
print(f'R-squared: {r_squared:.4f}')

# Returns: R-squared: 0.7919

In the code block above, we first generated our fake data using the same process as before. We then passed the flattened (1-dimensional) versions of our data into the linregress() function. Note that, if your data is already one-dimensional, then you don’t need to flatten it first.

The function returned a tuple with five values in it, which we neatly unpacked. Finally, to calculate the r-squared value, we square the r_value.

Similar to before, the coefficient of determination that is returned is equal to 0.79.

Conclusion

In conclusion, our exploration of the R-squared (R2) metric has provided valuable insights into its significance in assessing the goodness of fit in regression models. As a measure of the proportion of variance in the dependent variable predictable from the independent variable(s), R2 serves as a crucial tool for model evaluation.

Throughout this post, we’ve covered fundamental concepts such as the interpretation of R2 in regression analysis and the step-by-step calculation of the R-squared value. By demonstrating practical implementations using Python, we’ve showcased how to calculate R2 using popular libraries like scikit-learn and SciPy.

Understanding R-squared involves recognizing that a higher value indicates a better fit, representing the percentage of variability captured by the model. However, we’ve also emphasized that a high R-squared does not guarantee a flawless model, as it doesn’t address the correctness of model specifications or the reliability of predictions.

Armed with this knowledge, you are now well-equipped to assess and communicate the goodness of fit in your regression models, leveraging the capabilities of scikit-learn and SciPy for efficient R-squared calculations. Explore further, experiment with diverse datasets, and enhance your proficiency in model evaluation with these valuable tools. To delve deeper into scikit-learn, refer to the official documentation for a comprehensive guide on calculating the R-squared value. Happy coding!

To learn more about how to use scikit-learn to calculate the r-squared value, check out the official documentation.

Nik Piepenbreier

Nik is the author of datagy.io and has over a decade of experience working with data analytics, data science, and Python. He specializes in teaching developers how to use Python for data science using hands-on tutorials.View Author posts

Leave a Reply

Your email address will not be published. Required fields are marked *