Section 2.3 Correlation
You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how well two variables vary with each other.Definition 2.3.1. Sample Covariance.
Given paired data
with corresponding means ¯x and ¯y, the sample covariance is given by
Definition 2.3.2. Spearman's Rho (Correlation Coefficient).
Given a collection of data points, the Spearman's Rho correlation coefficient is given by
where sx is the standard deviation of the x-values only and sy is the standard deviation of the y-values only.
Theorem 2.3.3. Interpretation of the Spearman's Rho Correlation Coefficient.
If the points are co-linear with a positive slope then r = 1 and if the points are co-linear with a negative slope then r = -1.Proof.
Assume the data points are co-linear with a positive slope. Then the
for some \(m_0\) and \(b_0\text{.}\) For this line notice that \(f(x_k) = y_k\) exactly for all data points. It is easy to show then that
and
Therefore,
Putting these together gives correlation coefficient
A similar proof follows in the second case by noting that
Theorem 2.3.4. Alternate formula for r.
Proof.
Expand \((x_k-\overline{x})(y_k-\overline{y})\) and notice that the sample means are actually constants with respect to the summation.
x2k | xk | yk | y2k | xk⋅yk |
4 | -2 | 3 | 9 | -6 |
0 | 0 | 1 | 1 | 0 |
0 | 0 | -1 | 1 | 0 |
1 | 1 | 1 | 1 | 1 |
9 | 3 | 1 | 1 | 3 |
9 | 3 | -1 | 1 | -3 |
16 | 4 | 0 | 0 | 0 |
39 | 9 | 4 | 14 | 7 |
Checkpoint 2.3.6. WeBWorK - Interpreting correlation coefficients..
For each problem, select the best response.
(a) For a biology project, you measure the weight in grams and the tail length in millimeters of a group of mice. The correlation is r = 0.4. If you had measured tail length in centimeters instead of millimeters, what would be the correlation? (There are 10 millimeters in a centimeter.)
(0.4)(10) = 4
0.4
0.4/10 = 0.04
None of the above.
(b) A study found a correlation of r = -0.61 between the gender of a worker and his or her income. You may correctly conclude
women earn less than men on average.
an arithmetic mistake was made. Correlation must be positive.
this is incorrect because r makes no sense here.
women earn more than men on average.
None of the above.
Checkpoint 2.3.7. WeBWorK - Interpreting correlation coefficients again..
For each problem, select the best response.
(a) In a scatterplot of the average price of a barrel of oil and the average retail price of a gallon of gasoline, you expect to see
a positive association.
very little association.
a negative association.
None of the above.
(b) If the correlation between two variables is close to 0, you can conclude that a scatterplot would show
no straight-line pattern, but there might be a strong pattern of another form.
a cloud of points with no visible pattern.
a strong straight-line pattern.
None of the above.
(c) What are all the values that a correlation r can possibly take?
-1 ≤ r ≤ 1
0 ≤ r ≤ 1
r ≥ 0
None of the above.
xxxxxxxxxx
data(cars)
x <- cars[,1]
y <- cars[,2]
head = paste("Scatter Plot and Best-Fit Line for
Speed vs Stoping Distance \n Correlation Coefficient = ",cor(x,y))
pts = data.frame(x, y)
plot(pts,pch = 16, cex = 1.0, col = "blue", main = head, xlab = "x", ylab = "y")
lm(y ~ x)
abline(lm(y ~ x))