Tuesday, June 18, 2019

Linear Regression on four data sets

Another question in Chapter 6 of Christian Hill's book was about linear regression over four data sets.

Using NumPy, it was easy to find the mean and variance of both x and y for each data set. They turned out to be the same across the board, the mean of x is equal to 9, the mean of y is equal to 7.5, the variance of x is equal to 10 and the variance of y is equal to 3.75.

 The correlation coefficient is also the same across the board at .816.

The linear regression line was also the same across the board at approximately y= .5x + 3

However, once the data sets was plotted, we can see that they are in fact very different data sets even though all of the data sets had the same mean, variance, correlation coefficient, and line of best fit.

The x1, y1 data set looks like this graphically:
The x2, y2 data set looks like this:
The x3, y3 data set looks like this:
The x4, y4 data set looks like this:
As you can see, the scatter plots are very different, so how come they data set has so many similar statistical attributes? 

The mean

The mean of a data set is the average or mathematically, the sum of all the elements divided by the number of elements. A dataset that is very spread out can have the same mean as a dataset that is clustered together. For example, the dataset with elements {0, 5, 10} has mean 5 and the data set with elements {5, 5, 5} also has mean 5, but they are essentially, very different data sets since the range of the first one is 10, while the range of the latter is 0.

The variance

The variance of a data set tells us something about the spread of a data set. It is mathematically, the sum of the squares of the difference of the mean and each element divided by the number of elements. The variance tells us the average of the squared distances of each element from the mean.

The correlation coefficient

The correlation coefficient tells us something about how closely related two variables are to each other. Correlation coefficients are in the range of -1 and 1. -1 implies an indirect relationship while 1 implies direct relationship. Mathematically, the correlation coefficient is calculated by dividing the sums of the products of the distance of each x element and the mean of x and the distance of each y element and the mean of y by the square root of the product of the squares of the sums of  the difference of each x element and the mean of x and the squares of the sums of the difference of each y element and the mean of y. 

It may be easier to show you the formula.


Intuitively, if the difference between each element of x and the mean of x is constant and the difference between each element of y and the mean of y is also constant we get 1 as the correlation coefficient. For example, if the difference between the x data and the mean of x is {1, 1, 1, 1, 1} and the difference between the y data and the mean of of y is {2, 2, 2, 2, 2}, we get (1*2)+(1*2)+(1*2)+(1*2)+(1*2)= 10 as the numerator and square root of [(1+1+1+1+1)(4+4+4+4+4)]= square root of [5(20)] = square root of [100] = 10 as the denominator. This gives the correlation coefficient as 10/10 = 1. 

This is saying, if there is both very little difference in the spread between the x data and very little difference in the spread between the y data, our correlation coefficient goes up.

Intuitively, if there is both a lot of difference in the spread of the x and y data, our correlation coefficient will also go up. 

But, if there is a lot of difference in the spread of the x data and very little difference in the spread of the y data, our correlation coefficient will go down. For example, if the difference between the x data and the mean of x is {0, 0, 5, 5, 10, 10} and the difference between the y data and the mean of y is {1, 1, 1, 1, 1, 1}, we get 0+0+5+5+10+10 = 30 as the numerator and square roof of [ (0+0+25+25+100+100)(1+1+1+1+1+1)] = square root of [(250)(6)] = square root of [1500] = 38.7.  Our coefficient becomes 30/38.7 which is less than 1. 

So, the correlation coefficient in a sense measures how similar the spreads of our x and y data sets are. (Graphically, it represents the amount of clustering around the line of best fit.)

Regression Line

The regression line or line of best fit is a straight line that best models the data set. It takes the form of any line, y = mx+ b, where m, the slope, is equal to the correlation coefficient times the quotient of the standard deviation of y and the standard deviation of x. The y-intercept, b, is calculated by finding the difference between the mean of y and the product of the slope, m, and the mean of x. This makes sense since our a = mx+ b equation tells us that b = y - mx. We are using the mean of y and the mean of x to estimate a point on the regression line to find the y intercept.

Why is the slope equal to the correlation coefficient times the quotient of the standard deviation of y and the standard deviation of x? The standard deviation of y over the standard deviation of x is in a way the change of the spread of y over the spread of x. It is always positive or 0 since the standard deviation is the square root of the variance. We multiply that by the correlation coefficient, which gives us the sign of the slope, or the way x and y is related to each other. 

Proper use of the Regression line

As we see from the graphs, the same regression line can be fitted to four different scatter plots.

But, perhaps other than the first graph, a regression line is not suited to represent the scatter plots. Regression lines should only be used if the scatter plot act in a somewhat linear manner. In the second plot, the graph acts in a parabolic manner, the third plot has an outlier, and the fourth plot obviously does not act in a linear manner.

We can always fit a regression line to any data set, so it must be up to our good judgement to use regression lines when they fit a data set. 

It is best to plot the graph of a data set first to visually see if a regression line would work to model the data. 

Final Thoughts

I learned how to plot a regression line on my scatter plot with matplotlib.pyplot. I also learned how to find the correlation coefficient using numpy. This problem also made me try to intuitively explain what the correlation coefficient and the line of regression is. :D 

Github code - Four Data Sets

No comments:

Post a Comment