Skip to main content

Section 2.3 Correlation

You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how two variables might vary with each other.

Definition 2.3.1 Covariance

Given paired (sample) data

\begin{equation*} (x_0,y_0), (x_1,y_1), ... , (x_n,y_n) \end{equation*}

with corresponding means \(\overline{x}\) and \(\overline{y}\text{,}\) the covariance is given by

\begin{equation*} Cov(X,Y) = \sum_{k=0}^n (x_k-\overline{x})(y_k-\overline{y})/n \end{equation*}

and similarly if using population data in which you would use instead the mean of the x-values \(\mu_x\) and the mean of the y-values \(\mu_y\text{.}\)

\begin{align*} Cov(X,Y) & = \sum_{k=0}^n (x_k-\overline{x})(y_k-\overline{y})/n\\ & = \sum_{k=0}^n \left [ x_k y_k -\overline{x} y_k-\overline{y} x_k + \overline{x} \overline{y} \right ]/n.\\ & = \sum_{k=0}^n x_k y_k /n - \overline{x} \sum_{k=0}^n y_k /n - \overline{y} \sum_{k=0}^n x_k /n + \overline{x} \overline{y} \end{align*}

which simplifies to the desired result using the definition of the mean.

This general definition provides a general measure which is a second order term (like variance) but also maintains "units". To provide a unit-less metric, consider the following measure.

Definition 2.3.3 Correlation Coefficient

Given a collection of data points, the correlation coefficient is given by

\begin{equation*} r = \frac{Cov(X,Y)}{s_x s_y} \end{equation*}

where \(s_x\) is the standard deviation of the x-values only and \(s_y\) is the standard deviation of the y-values only. A similar statistics for population data would instead utilize \(\sigma_x\) and \(\sigma_y\) as the respective standard deviations of the x-values and y-values.

Assume the data points are colinear with a positive slope. Then the \(TSE(m_0,b_0) = 0\) for some \(m_0\) and \(b_0\text{.}\) For this line notice that \(f(x_k) = y_k\) exactly for all data points. It is easy to show then that \(\overline{y} = m_0 \overline{x} + b_0\) and \(s_y = | m_0 | s_x\text{.}\) Therefore,

\begin{equation*} Cov(X,Y) = \sum_{k=0}^n (x_k-\overline{x})(m_0 x_k + b_0 - (m_0 \overline{x} + b_0))/n = m_0 s_x^2 \end{equation*}

Putting these together gives correlation coefficient

\begin{equation*} r = \frac{m_0 s_x^2}{s_x m_0 s_x} = 1. \end{equation*}

A similar proof follows in the second case by noting that \(m_0 / | m_0 | = -1\text{.}\)

Interpreting correlation coefficients.

Interpreting correlation coefficients.

Consider the data points (1,1), (1,2), (2,1), (2,2). Plot these points and consider the nature of the best fit line. Show using software that the correlation coefficient is zero. Justify why TSE(m,b) = 1 must be the minimum.