Welcome to our exploration of R-squared (R2), a powerful metric in statistics that assesses the goodness of fit in regression models. R2 represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). In this post, we’ll guide you through the essentials of R2 and demonstrate how to calculate it using popular Python libraries such as scikit-learn (sklearn) and SciPy.
By the end of this post, you’ll have learned the following:
- What R2 signifies in the context of regression analysis
- How to calculate the R2 value in Scikit-Learn
- How to calculate the R2 value in SciPy
Table of Contents
Understanding the R-Squared (Coefficient of Determination) in Regression Analysis
In regression analysis, the R-squared (R2) value plays an important role in evaluating the performance of a model. The R-squared metric quantifies the proportion of the variance in the dependent variable that can be explained by the independent variable(s).
The value of R2 will range from 0 to 1, where a higher value indicates a better fit. A better fit, in this case, implies that a larger percentage of the variability in the dependent variables is captured by the model.
Let’s take a look at a visual example. Consider the graph below, where we have plotted our variables x and y using a scatterplot. We have then added a line of best fit, which we calculated using linear regression methods in Scikit-Learn.
In the image above, there are a few things to take note of:
- Our x and y variables seem to form a linear relationship, which is plotted with a red line of best fit
- The grey dashed lines show how far each data point is away from the line of best fit
We can see that while the line explains the general trend of the data, it’s not perfect. We can also see that the R2 value is included in the title. Because our value is 0.79, we can understand that 79% of the variable in the dependent variable is accounted for by the independent variable. This means that 21% of the variance is still unexplained!
While a high R-squared value is generally desirable, it doesn’t always inform whether a model has shortcomings or not. For example, it doesn’t inform us about the correctness of the model’s specifications or the reliability its predictions. Instead, the value focuses only on the proportion of variance explained.
Now that you have a good understanding of what the r-squared represents, let’s dive into how to calculate it using sklearn, the popular machine learning library.
How to Calculate R-Squared in Scikit-Learn
The popular machine-learning library, Scikit-Learn makes it easy to work with linear regression models and to calculate the r-squared value. In order to do this, we first need to create a Linear Regression model, which will allow us to generate a line of best fit.
In order to do this, we can first create some fake data, create a Linear Regression model, and use it to generate our line of best fit.
# Creating a Linear Model with Scikit-Learn
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Generate synthetic data
np.random.seed(42)
X = 2 * np.random.rand(25, 1)
y = 4 + 3 * X + np.random.randn(25, 1)
# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
Since there’s quite a lot going on in the code block above, let’s break it down:
- We first import our libraries and generate random data using a random seed. We created both 25 x and 25 y variables.
- We then instantiate a Linear Regression class and fit our data to it. This generates the line of best fit and other metrics associated with it.
- We also used the model to make some predictions, which we saved into the y_pred variable
Now that we have our predictions, we can calculate the r-squared value using the r2_score
function from scikit-learn. Let’s see how this works:
# How to Calculate the R-Squared Value in Scikit-Learn
r2 = r2_score(y, y_pred)
print(r2)
# Returns: 0.7918569668880392
In the code block above, we passed our original y-values and our predicted y-values into the r2_score()
function, which returns the r-squared value.
Now that you know how to calculate the r-squared in scikit-learn, let’s explore how you can use SciPy to calculate the important metric.
How to Calculate R-Squared in SciPy
In this section, we’ll explore how to calculate the R-squared value using the SciPy library. This implementation is a bit simpler, if you don’t need a full model to work with.
Using the linregress()
function from the SciPy library, we can easily calculate important metrics, such as:
- The slope,
- The y-intercept or bias,
- The r-value,
- The p-value, and
- The standard error
Once we have the r-value returned, we can calculate r-squared by simply squaring this value. Let’s take a look at what this looks like:
# How to Calculate R-Squared Using SciPy
import numpy as np
from scipy.stats import linregress
# Generate fake data
np.random.seed(42)
X = 2 * np.random.rand(25, 1)
y = 4 + 3 * X + np.random.randn(25, 1)
# Fit a linear regression model using linregress
slope, intercept, r_value, p_value, std_err = linregress(X.flatten(), y.flatten())
# Calculate R-squared
r_squared = r_value**2
print(f'R-squared: {r_squared:.4f}')
# Returns: R-squared: 0.7919
In the code block above, we first generated our fake data using the same process as before. We then passed the flattened (1-dimensional) versions of our data into the linregress()
function. Note that, if your data is already one-dimensional, then you don’t need to flatten it first.
The function returned a tuple with five values in it, which we neatly unpacked. Finally, to calculate the r-squared value, we square the r_value.
Similar to before, the coefficient of determination that is returned is equal to 0.79.
Conclusion
In conclusion, our exploration of the R-squared (R2) metric has provided valuable insights into its significance in assessing the goodness of fit in regression models. As a measure of the proportion of variance in the dependent variable predictable from the independent variable(s), R2 serves as a crucial tool for model evaluation.
Throughout this post, we’ve covered fundamental concepts such as the interpretation of R2 in regression analysis and the step-by-step calculation of the R-squared value. By demonstrating practical implementations using Python, we’ve showcased how to calculate R2 using popular libraries like scikit-learn and SciPy.
Understanding R-squared involves recognizing that a higher value indicates a better fit, representing the percentage of variability captured by the model. However, we’ve also emphasized that a high R-squared does not guarantee a flawless model, as it doesn’t address the correctness of model specifications or the reliability of predictions.
Armed with this knowledge, you are now well-equipped to assess and communicate the goodness of fit in your regression models, leveraging the capabilities of scikit-learn and SciPy for efficient R-squared calculations. Explore further, experiment with diverse datasets, and enhance your proficiency in model evaluation with these valuable tools. To delve deeper into scikit-learn, refer to the official documentation for a comprehensive guide on calculating the R-squared value. Happy coding!
To learn more about how to use scikit-learn to calculate the r-squared value, check out the official documentation.