Skip to main content

Section 2.3 Correlation

You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how well two variables vary with each other.
Determine the best-fit line for the following data:
\begin{equation*} (0,1),(1,1),(1,0),(0,0) \end{equation*}
and then again for
\begin{equation*} (0,1),(1,1),(1,2),(2,2). \end{equation*}
Plot both data sets and plot their corresponding best-fit lines.
  • Comment on how well each line seems to describe the data sets.
  • Might you be more likely to use the best-fit line to predict an outcome when the first coordinate is 1/2? Why or why not?
  • Are you willing to say one line is "better" than the other?

Definition 2.3.2. Sample Covariance.

Given paired data
\begin{equation*} (x_1,y_1), (x_2,y_2), ... , (x_n,y_n) \end{equation*}
with corresponding means \(\overline{x}\) and \(\overline{y}\text{,}\) the sample covariance is given by
\begin{equation*} s_{xy} = \sum_{k=1}^n \frac{(x_k-\overline{x})(y_k-\overline{y})}{n-1}. \end{equation*}
This general definition provides a general measure which is a second order term (like variance) but also maintains "units". To provide a unit-less metric, consider the following measure.

Definition 2.3.3. Spearman’s Rho (Correlation Coefficient).

Given a collection of data points, the Spearman’s Rho correlation coefficient is given by
\begin{equation*} r = \frac{s_{xy}}{s_x s_y} \end{equation*}
where \(s_x\) is the standard deviation of the x-values only and \(s_y\) is the standard deviation of the y-values only.
Assume the data points are co-linear with a positive slope. Then the
\begin{equation*} TSE(m_0,b_0) = 0 \end{equation*}
for some \(m_0\) and \(b_0\text{.}\) For this line notice that \(f(x_k) = y_k\) exactly for all data points. It is easy to show then that
\begin{equation*} \overline{y} = m_0 \overline{x} + b_0 \end{equation*}
and
\begin{equation*} s_y = | m_0 | s_x\text{.} \end{equation*}
Therefore,
\begin{equation*} s_{xy} = \sum_{k=1}^n \frac{(x_k - \overline{x})(m_0 x_k + b_0 - (m_0 \overline{x} + b_0))}{n-1} = m_0 s_x^2 \end{equation*}
Putting these together gives correlation coefficient
\begin{equation*} r = \frac{m_0 s_x^2}{s_x m_0 s_x} = 1. \end{equation*}
A similar proof follows in the second case by noting that
\begin{equation*} m_0 / | m_0 | = -1. \end{equation*}
Expand \((x_k-\overline{x})(y_k-\overline{y})\) and notice that the sample means are actually constants with respect to the summation.
To compute the best-fit line and correlation coefficient by hand, it is often easiest to construct a table in a manner similar to how you might have computed the mean, variance, skewness and kurtosis in the previous chapter. Indeed,
Table 2.3.6. Computing best fit line and correlation coefficient by hand
\(x_k^2\) \(x_k\) \(y_k\) \(y_k^2\) \(x_k \cdot y_k\)
4 -2 3 9 -6
0 0 1 1 0
0 0 -1 1 0
1 1 1 1 1
9 3 1 1 3
9 3 -1 1 -3
16 4 0 0 0
39 9 4 14 7
Therefore
\begin{equation*} \overline{x} = \frac{9}{7} \end{equation*}
\begin{equation*} \overline{y} = \frac{4}{7} \end{equation*}
\begin{equation*} s_x^2 = \frac{7}{6} \cdot \left [ \frac{39}{7} - \left ( \frac{9}{7} \right )^2 \right ] = \frac{32}{7} \end{equation*}
\begin{equation*} s_y^2 = \frac{7}{6} \cdot \left [ \frac{14}{7} - \left ( \frac{4}{7} \right )^2 \right ] = \frac{41}{21} \end{equation*}
\begin{equation*} s_{xy} = \frac{7}{6} \cdot \left [ \frac{7}{7} - \frac{9}{7} \cdot \frac{4}{7} \right ] = \frac{13}{42} \end{equation*}
\begin{equation*} r = \frac{\frac{13}{42}}{\sqrt{\frac{32}{7}} \sqrt{\frac{41}{21}}} \end{equation*}
and the normal equations to find the actual line are
\begin{equation*} 39m + 9 b = 7 \text{ and } 9m + 7b = 4. \end{equation*}
You should go ahead and simplify these values, solve the system of equations, plot the points and draw the line to complete the creation and evaluation of this best-fit line.
Consider the following small data set.
Subject x y
\(1\) \(6\) \(19\)
\(2\) \(8\) \(17\)
\(3\) \(14\) \(21\)
\(4\) \(8\) \(31\)
\(5\) \(6\) \(17\)
Find the linear correlation coefficient.
\(r =\)
Answer.
\(0.156556072771287\)
A study found a correlation of r = -0.33 between the gender of a worker and that worker’s net pay. From the following statements, which is true?
  1. Women earn less than men on average
  2. Women earn more than men on average
  3. Correlation can’t be negative in this instance
  4. r makes no sense in this context
  5. None of the above
Enter the letter of the correct statement:
A science project considered the possible relationship between the weight in grams to the length of the tail in a group of mice. The correlation between these two measurements is r = 0.5. If the weights had instead been measured in kilograms instead of grams, what would be the resulting correlation?
  1. 0.5
  2. 0.5/1000
  3. 0.5(1000)
  4. r makes no sense in this context
  5. None of the above
Enter the letter of the correct statement:
Answer 1.
\(d\)
Answer 2.
\(a\)
Solution.
In the first case, r makes no sense since one can only create perform linear regression on numerical ordered paired data. As stated, the notion of gender is not a numerical quantity and thus cannot (for example) create a scatter plot. One might however “randomly” assign a numerical value to gender (such as 1 = woman and 0 = man) but that is not part of this exercise.
For the second case, adjusting the scale of the data points will not affect the correlation coefficient. Indeed, suppose all the x’s are scaled larger by a factor of 1000. Then, the mean and the standard deviation of x-values will also be 1000 times larger (convince yourself of this) and the covariance will be 1000 times larger. Indeed, for covariance
\begin{equation*} s_{xy} = \frac{n}{n-1} \left ( \sum \frac{1000x_k \cdot y_k}{n} - (1000 \overline{x})(\overline{y}) \right ) \end{equation*}
and one can easily see that the 1000 factors out. So, 1000 will factor out of both numerator and denominator and hence cancel in the covariance formula.
Once again, we can use R to determine and display the data points and best-fit line with one of the built-in data sets. To do this, that data set needs to have the opportunity to utilize paired data. Below, we use the ’cars’ data set since that set has speed paired with a corresponding stopping distance from some experiment. See if you can use one of the other data sets to create a best-fit line and determine if the correlation coefficient is sufficiently ’good’ or not.
Repeat the above R code but instead use two of the columns from EuStockMarkets for the data. Consider why you might get a relatively high correlation coefficient with this data.
Again, repeat the above R code by instead use columns 2 and 4 from USJudgeRatings. Notice the relatively large correlation coefficient but ponder if there is really any connection between these two columns. Repeat for columns 1 and 4 and interpret the results.