Correlation

Section 2.3 Correlation

You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how well two variables vary with each other.

Definition 2.3.1. Sample Covariance.

Given paired data

\begin{equation*} (x_1,y_1), (x_2,y_2), ... , (x_n,y_n) \end{equation*}

with corresponding means \(\overline{x}\) and \(\overline{y}\text{,}\) the sample covariance is given by

\begin{equation*} s_{xy} = \sum_{k=1}^n \frac{(x_k-\overline{x})(y_k-\overline{y})}{n-1}. \end{equation*}

This general definition provides a general measure which is a second order term (like variance) but also maintains "units". To provide a unit-less metric, consider the following measure.

Definition 2.3.2. Spearman's Rho (Correlation Coefficient).

Given a collection of data points, the Spearman's Rho correlation coefficient is given by

\begin{equation*} r = \frac{s_{xy}}{s_x s_y} \end{equation*}

where \(s_x\) is the standard deviation of the x-values only and \(s_y\) is the standard deviation of the y-values only.

Theorem 2.3.3. Interpretation of the Spearman's Rho Correlation Coefficient.

If the points are co-linear with a positive slope then r = 1 and if the points are co-linear with a negative slope then r = -1.

Proof.

Assume the data points are co-linear with a positive slope. Then the

\begin{equation*} TSE(m_0,b_0) = 0 \end{equation*}

for some \(m_0\) and \(b_0\text{.}\) For this line notice that \(f(x_k) = y_k\) exactly for all data points. It is easy to show then that

\begin{equation*} \overline{y} = m_0 \overline{x} + b_0 \end{equation*}

and

\begin{equation*} s_y = | m_0 | s_x\text{.} \end{equation*}

Therefore,

\begin{equation*} s_{xy} = \sum_{k=1}^n \frac{(x_k - \overline{x})(m_0 x_k + b_0 - (m_0 \overline{x} + b_0))}{n-1} = m_0 s_x^2 \end{equation*}

Putting these together gives correlation coefficient

\begin{equation*} r = \frac{m_0 s_x^2}{s_x m_0 s_x} = 1. \end{equation*}

A similar proof follows in the second case by noting that

\begin{equation*} m_0 / | m_0 | = -1. \end{equation*}

Theorem 2.3.4. Alternate formula for r.

\begin{equation*} s_{xy} = \frac{n}{n-1} \left [ \frac{\sum_{k=1}^n x_k y_k}{n} - \overline{x} \overline{y} \right ]. \end{equation*}

Proof.

Expand \((x_k-\overline{x})(y_k-\overline{y})\) and notice that the sample means are actually constants with respect to the summation.

To compute the best-fit line and correlation coefficient by hand, it is often easiest to construct a table in a manner similar to how you might have computed the mean, variance, skewness and kurtosis in the previous chapter. Indeed,

Table 2.3.5. Computing best fit line and correlation coefficient by hand

\(x_k^2\)	\(x_k\)	\(y_k\)	\(y_k^2\)	\(x_k \cdot y_k\)
4	-2	3	9	-6
0	0	1	1	0
0	0	-1	1	0
1	1	1	1	1
9	3	1	1	3
9	3	-1	1	-3
16	4	0	0	0
39	9	4	14	7

Therefore

\begin{equation*} \overline{x} = \frac{9}{7} \end{equation*}

\begin{equation*} \overline{y} = \frac{4}{7} \end{equation*}

\begin{equation*} s_x^2 = \frac{7}{6} \cdot \left [ \frac{39}{7} - \left ( \frac{9}{7} \right )^2 \right ] = \frac{32}{7} \end{equation*}

\begin{equation*} s_y^2 = \frac{7}{6} \cdot \left [ \frac{14}{7} - \left ( \frac{4}{7} \right )^2 \right ] = \frac{41}{21} \end{equation*}

\begin{equation*} s_{xy} = \frac{7}{6} \cdot \left [ \frac{7}{7} - \frac{9}{7} \cdot \frac{4}{7} \right ] = \frac{13}{42} \end{equation*}

\begin{equation*} r = \frac{\frac{13}{42}}{\sqrt{\frac{32}{7}} \sqrt{\frac{41}{21}}} \end{equation*}

and the normal equations to find the actual line are

\begin{equation*} 39m + 9 b = 7 \text{ and } 9m + 7b = 4. \end{equation*}

You should go ahead and simplify these values, solve the system of equations, plot the points and draw the line to complete the creation and evaluation of this best-fit line.

Checkpoint 2.3.6. WeBWorK - Interpreting correlation coefficients..

For each problem, select the best response.

(a) For a biology project, you measure the weight in grams and the tail length in millimeters of a group of mice. The correlation is r = 0.4. If you had measured tail length in centimeters instead of millimeters, what would be the correlation? (There are 10 millimeters in a centimeter.)

(0.4)(10) = 4
0.4
0.4/10 = 0.04
None of the above.

(b) A study found a correlation of r = -0.61 between the gender of a worker and his or her income. You may correctly conclude

women earn less than men on average.
an arithmetic mistake was made. Correlation must be positive.
this is incorrect because r makes no sense here.
women earn more than men on average.
None of the above.

Checkpoint 2.3.7. WeBWorK - Interpreting correlation coefficients again..

For each problem, select the best response.

(a) In a scatterplot of the average price of a barrel of oil and the average retail price of a gallon of gasoline, you expect to see

a positive association.
very little association.
a negative association.
None of the above.

(b) If the correlation between two variables is close to 0, you can conclude that a scatterplot would show

no straight-line pattern, but there might be a strong pattern of another form.
a cloud of points with no visible pattern.
a strong straight-line pattern.
None of the above.

-1 \(\le\) r \(\le\) 1
0 \(\le\) r \(\le\) 1
r \(\ge\) 0
None of the above.

Once again, we can use R to determine and display the data points and best-fit line with one of the built-in data sets. To do this, that data set needs to have the opportunity to utilize paired data. Below, we use the 'cars' data set since that set has speed paired with a corresponding stopping distance from some experiment. See if you can use one of the other data sets to create a best-fit line and determine if the correlation coefficient is sufficiently 'good' or not.

Repeat the above R code but instead use two of the columns from EuStockMarkets for the data. Consider why you might get a relatively high correlation coefficient with this data.

Again, repeat the above R code by instead use columns 2 and 4 from USJudgeRatings. Notice the relatively large correlation coefficient but ponder if there is really any connection between these two columns. Repeat for columns 1 and 4 and interpret the results.