Wednesday, November 27, 2019

Notes on Root Mean Squared Error and R^2

Linear Regression Metrics

Two popular linear regression metrics are the root mean squared error and R^2.

Say you have a data set and have plotted the line of best fit. How well does the line of best fit actually model the data set? Visually, if the data points are very close to the line of best fit, you can say that it is a pretty good model compared to data points that are scattered everywhere and not particularly hugging the line of best fit. But, there is a more rigorous way of telling whether your linear model is good or not.

SSE- Sum of Squared Errors 

The most straightforward way to see whether or not your model is a good fit compared to other models is to find the sum of the squared errors. This means you sum up all the squared values of the difference of the actual values and the predicted values. The lower the sum, the better the model.

MAE- Mean Absolute Error

Another way to compare models is to find the mean absolute error, which is just the average of the absolute values of the difference of the actual values and predicted values. Again, the lower the mean, the better the model. SSE is more sensitive to outliers (since we are squaring the values) than MAE is. But, MAE requires less refinement to be interpretable since the MAE returns the original units of the values. You would need to divide SSE by the total number of values and then take the square root of that for it to be more interpretable. Which leads us to...

Root Mean Squared Error

The root mean squared error is simply the square root of the SSE divided by the total number of values. This creates a more interpretable metric for how good our model is. Again, RSME, like SSE is more sensitive to outliers and would penalize them more than MAE.

R^2

R^2, or the coefficient of determination, is another metric you could use to see how good your model is. R^2 is usually between 0 and 1, but there are cases where R^2 can be negative. (That's really 'bad'.) The idea behind R^2 is to measure the target "variance explained" by the model. (How much variance of the target can be explained by the variance of the inputs?)

SST, or the sum of the squares total) is the target's intrinsic variance from its mean. It is given by the sum of the squares of the difference of the actual values and the mean of the actual values.

SSE/SST gives us the percent of the variance that remains once you've modeled the data. The sum of the squared errors over the sum of the squares total is the portion of the variance that the model cannot explain. However, we can subtract SSE/SST from 1 and get the portion of the variance that the model can explain.

1 - SSE/SST is the coefficient of determination, or R^2. "R^2 measures how good [your] regression fit is relative to a constant model that predicts the mean".

(Usually SSE is smaller than SST, assuming that our model does a little or much better than just predicting the mean all the time, so R^2 is between 0 and 1.)

No comments:

Post a Comment