Skip to main content

Section 2.3 Correlation

You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how well two variables vary with each other.

Definition 2.3.1. Sample Covariance.

Given paired data
\begin{equation*} (x_1,y_1), (x_2,y_2), ... , (x_n,y_n) \end{equation*}
with corresponding means \(\overline{x}\) and \(\overline{y}\text{,}\) the sample covariance is given by
\begin{equation*} s_{xy} = \sum_{k=1}^n \frac{(x_k-\overline{x})(y_k-\overline{y})}{n-1}. \end{equation*}
This general definition provides a general measure which is a second order term (like variance) but also maintains "units". To provide a unit-less metric, consider the following measure.

Definition 2.3.2. Spearman's Rho (Correlation Coefficient).

Given a collection of data points, the Spearman's Rho correlation coefficient is given by
\begin{equation*} r = \frac{s_{xy}}{s_x s_y} \end{equation*}
where \(s_x\) is the standard deviation of the x-values only and \(s_y\) is the standard deviation of the y-values only.
Assume the data points are co-linear with a positive slope. Then the
\begin{equation*} TSE(m_0,b_0) = 0 \end{equation*}
for some \(m_0\) and \(b_0\text{.}\) For this line notice that \(f(x_k) = y_k\) exactly for all data points. It is easy to show then that
\begin{equation*} \overline{y} = m_0 \overline{x} + b_0 \end{equation*}
and
\begin{equation*} s_y = | m_0 | s_x\text{.} \end{equation*}
Therefore,
\begin{equation*} s_{xy} = \sum_{k=1}^n \frac{(x_k - \overline{x})(m_0 x_k + b_0 - (m_0 \overline{x} + b_0))}{n-1} = m_0 s_x^2 \end{equation*}
Putting these together gives correlation coefficient
\begin{equation*} r = \frac{m_0 s_x^2}{s_x m_0 s_x} = 1. \end{equation*}
A similar proof follows in the second case by noting that
\begin{equation*} m_0 / | m_0 | = -1. \end{equation*}
Expand \((x_k-\overline{x})(y_k-\overline{y})\) and notice that the sample means are actually constants with respect to the summation.
To compute the best-fit line and correlation coefficient by hand, it is often easiest to construct a table in a manner similar to how you might have computed the mean, variance, skewness and kurtosis in the previous chapter. Indeed,
Table 2.3.5. Computing best fit line and correlation coefficient by hand
\(x_k^2\) \(x_k\) \(y_k\) \(y_k^2\) \(x_k \cdot y_k\)
4 -2 3 9 -6
0 0 1 1 0
0 0 -1 1 0
1 1 1 1 1
9 3 1 1 3
9 3 -1 1 -3
16 4 0 0 0
39 9 4 14 7
Therefore
\begin{equation*} \overline{x} = \frac{9}{7} \end{equation*}
\begin{equation*} \overline{y} = \frac{4}{7} \end{equation*}
\begin{equation*} s_x^2 = \frac{7}{6} \cdot \left [ \frac{39}{7} - \left ( \frac{9}{7} \right )^2 \right ] = \frac{32}{7} \end{equation*}
\begin{equation*} s_y^2 = \frac{7}{6} \cdot \left [ \frac{14}{7} - \left ( \frac{4}{7} \right )^2 \right ] = \frac{41}{21} \end{equation*}
\begin{equation*} s_{xy} = \frac{7}{6} \cdot \left [ \frac{7}{7} - \frac{9}{7} \cdot \frac{4}{7} \right ] = \frac{13}{42} \end{equation*}
\begin{equation*} r = \frac{\frac{13}{42}}{\sqrt{\frac{32}{7}} \sqrt{\frac{41}{21}}} \end{equation*}
and the normal equations to find the actual line are
\begin{equation*} 39m + 9 b = 7 \text{ and } 9m + 7b = 4. \end{equation*}
You should go ahead and simplify these values, solve the system of equations, plot the points and draw the line to complete the creation and evaluation of this best-fit line.
Consider the following small data set.
Subject x y
\(1\) \(6\) \(19\)
\(2\) \(8\) \(17\)
\(3\) \(14\) \(21\)
\(4\) \(8\) \(31\)
\(5\) \(6\) \(17\)
Find the linear correlation coefficient.
\(r =\)
Answer.
\(0.156556072771287\)
A study found a correlation of r = -0.33 between the gender of a worker and that worker’s net pay. From the following statements, which is true?
  1. Women earn less than men on average
  2. Women earn more than men on average
  3. Correlation can’t be negative in this instance
  4. r makes no sense in this context
  5. None of the above
Enter the letter of the correct statement:
A science project considered the possible relationship between the weight in grams to the length of the tail in a group of mice. The correlation between these two measurements is r = 0.5. If the weights had instead been measured in kilograms instead of grams, what would be the resulting correlation?
  1. 0.5
  2. 0.5/1000
  3. 0.5(1000)
  4. r makes no sense in this context
  5. None of the above
Enter the letter of the correct statement:
Answer 1.
\(d\)
Answer 2.
\(a\)
Once again, we can use R to determine and display the data points and best-fit line with one of the built-in data sets. To do this, that data set needs to have the opportunity to utilize paired data. Below, we use the 'cars' data set since that set has speed paired with a corresponding stopping distance from some experiment. See if you can use one of the other data sets to create a best-fit line and determine if the correlation coefficient is sufficiently 'good' or not.
Repeat the above R code but instead use two of the columns from EuStockMarkets for the data. Consider why you might get a relatively high correlation coefficient with this data.
Again, repeat the above R code by instead use columns 2 and 4 from USJudgeRatings. Notice the relatively large correlation coefficient but ponder if there is really any connection between these two columns. Repeat for columns 1 and 4 and interpret the results.