Section 2.3 Correlation
You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how well two variables vary with each other.
Determine the best-fit line for the following data:
and then again for
Plot both data sets and plot their corresponding best-fit lines.
- Comment on how well each line seems to describe the data sets.
- Might you be more likely to use the best-fit line to predict an outcome when the first coordinate is 1/2? Why or why not?
- Are you willing to say one line is "better" than the other?
Definition 2.3.2. Sample Covariance.
This general definition provides a general measure which is a second order term (like variance) but also maintains "units". To provide a unit-less metric, consider the following measure.
Definition 2.3.3. Spearmanβs Rho (Correlation Coefficient).
xxxxxxxxxx
pretty_print("Here we create the best-fit line and correlation coefficient by first creating the normal equations and solving")
var('u')
β
# Enter the data points here with (x,y) values paired in like positions in the list
x = [-1.0,3,4,6]
y = [1.0,1,3,4]
n = len(x)
β
Points = [(x[k],y[k]) for k in range(n)] # Create graph of data points
G = point(Points)
β
xsq = []
ysq = []
xy = []
for k in range(n):
xsq.append(x[k]^2)
ysq.append(y[k]^2)
xy.append(x[k]*y[k])
β
# Put everything in the right place for the normal equations
a11 = sum(xsq)
a12 = sum(x)
a21 = a12
a22 = n
rhs1 = sum(xy)
rhs2 = sum(y)
ysqsum = sum(ysq)
A = matrix([[a11,a12],[a21,a22]])
rhs = matrix([[rhs1],[rhs2]])
show([A,rhs])
β
# and then solve
coefs = A.inverse()*rhs
m = coefs[0][0]
b = coefs[1][0]
show(['m=',m,' b=',b])
β
# while at it, create the standard deviations, covariance, and correlation coefficient
sx = sqrt((n/(n-1))*(a11/n-(a12/n)^2))
sy = sqrt((n/(n-1))*(ysqsum/n-(rhs2/n)^2))
sxy = sqrt((n/(n-1))*(rhs1/n-(a12/n)*(rhs2/n)))
r = sxy/(sx*sy)
show(['r=',r])
β
G += plot(m*u+1,(u,min(x),max(x)),color='red')
show(G,figsize=[5,4])
Theorem 2.3.4. Interpretation of the Spearmanβs Rho Correlation Coefficient.
If the points are co-linear with a positive slope then r = 1 and if the points are co-linear with a negative slope then r = -1.Proof.
Assume the data points are co-linear with a positive slope. Then the
for some and For this line notice that exactly for all data points. It is easy to show then that
and
Therefore,
Putting these together gives correlation coefficient
A similar proof follows in the second case by noting that
Theorem 2.3.5. Alternate formula for r.
Proof.
Expand and notice that the sample means are actually constants with respect to the summation.
To compute the best-fit line and correlation coefficient by hand, it is often easiest to construct a table in a manner similar to how you might have computed the mean, variance, skewness and kurtosis in the previous chapter. Indeed, Table 2.3.6. Computing best fit line and correlation coefficient by hand
Therefore
4 | -2 | 3 | 9 | -6 |
0 | 0 | 1 | 1 | 0 |
0 | 0 | -1 | 1 | 0 |
1 | 1 | 1 | 1 | 1 |
9 | 3 | 1 | 1 | 3 |
9 | 3 | -1 | 1 | -3 |
16 | 4 | 0 | 0 | 0 |
39 | 9 | 4 | 14 | 7 |
and the normal equations to find the actual line are
You should go ahead and simplify these values, solve the system of equations, plot the points and draw the line to complete the creation and evaluation of this best-fit line.
Checkpoint 2.3.7. WeBWorK - Calculating correlation coefficients..
Checkpoint 2.3.8. WeBWorK - Interpreting correlation coefficients..
A study found a correlation of r = -0.33 between the gender of a worker and that workerβs net pay. From the following statements, which is true?
- Women earn less than men on average
- Women earn more than men on average
- Correlation canβt be negative in this instance
- r makes no sense in this context
- None of the above
Enter the letter of the correct statement:
A science project considered the possible relationship between the weight in grams to the length of the tail in a group of mice. The correlation between these two measurements is r = 0.5. If the weights had instead been measured in kilograms instead of grams, what would be the resulting correlation?
- 0.5
- 0.5/1000
- 0.5(1000)
- r makes no sense in this context
- None of the above
Enter the letter of the correct statement:
Answer 1.
Answer 2.
Solution.
In the first case, r makes no sense since one can only create perform linear regression on numerical ordered paired data. As stated, the notion of gender is not a numerical quantity and thus cannot (for example) create a scatter plot. One might however βrandomlyβ assign a numerical value to gender (such as 1 = woman and 0 = man) but that is not part of this exercise.
For the second case, adjusting the scale of the data points will not affect the correlation coefficient. Indeed, suppose all the xβs are scaled larger by a factor of 1000. Then, the mean and the standard deviation of x-values will also be 1000 times larger (convince yourself of this) and the covariance will be 1000 times larger. Indeed, for covariance
and one can easily see that the 1000 factors out. So, 1000 will factor out of both numerator and denominator and hence cancel in the covariance formula.
Once again, we can use R to determine and display the data points and best-fit line with one of the built-in data sets. To do this, that data set needs to have the opportunity to utilize paired data. Below, we use the βcarsβ data set since that set has speed paired with a corresponding stopping distance from some experiment. See if you can use one of the other data sets to create a best-fit line and determine if the correlation coefficient is sufficiently βgoodβ or not.
xxxxxxxxxx
data(cars)
x <- cars[,1]
y <- cars[,2]
head = paste("Scatter Plot and Best-Fit Line for
Speed vs Stoping Distance \n Correlation Coefficient = ",cor(x,y))
pts = data.frame(x, y)
plot(pts,pch = 16, cex = 1.0, col = "blue", main = head, xlab = "x", ylab = "y")
lm(y ~ x)
abline(lm(y ~ x))
Repeat the above R code but instead use two of the columns from EuStockMarkets for the data. Consider why you might get a relatively high correlation coefficient with this data.
Again, repeat the above R code by instead use columns 2 and 4 from USJudgeRatings. Notice the relatively large correlation coefficient but ponder if there is really any connection between these two columns. Repeat for columns 1 and 4 and interpret the results.