Correlation

Section 2.3 Correlation

You can plot points and plot the resulting best-fit line determined in the previous section but the question remains whether the line is any good. In particular, the real use of the line often is to subsequently predict y-values for a given x-value. However, it is very likely that the best-fit line does not even pass through any of the provided data points. So, how can something that misses every marker still be considered a good fit. To quantify this, we first need to discuss a way to measure how two variables might vary with each other.

Definition 2.3.1 Covariance

Given paired (sample) data

(x_{0}, y_{0}), (x_{1}, y_{1}), . . ., (x_{n}, y_{n})

$\begin{equation*} (x_0,y_0), (x_1,y_1), ... , (x_n,y_n) \end{equation*}$

with corresponding means $\overline{x}$ and $\overline{y}\text{,}$ the covariance is given by

C o v (X, Y) = \sum_{k = 0}^{n} (x_{k} - \bar{x}) (y_{k} - \bar{y}) / n

$\begin{equation*} Cov(X,Y) = \sum_{k=0}^n (x_k-\overline{x})(y_k-\overline{y})/n \end{equation*}$

and similarly if using population data in which you would use instead the mean of the x-values $\mu_x$ and the mean of the y-values $\mu_y\text{.}$

Theorem 2.3.2 Alternate Formula for Covariance

C o v (X, Y) = \frac{\sum_{k = 0}^{n} x_{k} y_{k}}{n} - \bar{x} \bar{y}

$\begin{equation*} Cov(X,Y) = \frac{\sum_{k=0}^n x_k y_k}{n} - \overline{x} \overline{y} \end{equation*}$

Proof

\begin{align*} Cov(X,Y) & = \sum_{k=0}^n (x_k-\overline{x})(y_k-\overline{y})/n\\ & = \sum_{k=0}^n \left [ x_k y_k -\overline{x} y_k-\overline{y} x_k + \overline{x} \overline{y} \right ]/n.\\ & = \sum_{k=0}^n x_k y_k /n - \overline{x} \sum_{k=0}^n y_k /n - \overline{y} \sum_{k=0}^n x_k /n + \overline{x} \overline{y} \end{align*}

which simplifies to the desired result using the definition of the mean.

This general definition provides a general measure which is a second order term (like variance) but also maintains "units". To provide a unit-less metric, consider the following measure.

Definition 2.3.3 Correlation Coefficient

Given a collection of data points, the correlation coefficient is given by

r = \frac{C o v (X, Y)}{s_{x} s_{y}}

$\begin{equation*} r = \frac{Cov(X,Y)}{s_x s_y} \end{equation*}$

where $s_x$ is the standard deviation of the x-values only and $s_y$ is the standard deviation of the y-values only. A similar statistics for population data would instead utilize $\sigma_x$ and $\sigma_y$ as the respective standard deviations of the x-values and y-values.

Theorem 2.3.4 Interpretation of the Correlation Coefficient

If the points are colinear with a positive slope then r=1 and if the points are collinear with a negative slope then r=-1.

Proof

Assume the data points are colinear with a positive slope. Then the $TSE(m_0,b_0) = 0$ for some $m_0$ and $b_0\text{.}$ For this line notice that $f(x_k) = y_k$ exactly for all data points. It is easy to show then that $\overline{y} = m_0 \overline{x} + b_0$ and $s_y = | m_0 | s_x\text{.}$ Therefore,

\begin{equation*} Cov(X,Y) = \sum_{k=0}^n (x_k-\overline{x})(m_0 x_k + b_0 - (m_0 \overline{x} + b_0))/n = m_0 s_x^2 \end{equation*}

Putting these together gives correlation coefficient

\begin{equation*} r = \frac{m_0 s_x^2}{s_x m_0 s_x} = 1. \end{equation*}

A similar proof follows in the second case by noting that $m_0 / | m_0 | = -1\text{.}$

Essentials of Mathematical Probability and StatisticsA First Course For the Mathematically Inclined

Section 2.3 Correlation

Definition 2.3.1 Covariance

Theorem 2.3.2 Alternate Formula for Covariance

Proof

Definition 2.3.3 Correlation Coefficient

Theorem 2.3.4 Interpretation of the Correlation Coefficient

Proof

Checkpoint 2.3.5 WebWork

Checkpoint 2.3.6 WebWork

Checkpoint 2.3.7 Correlation equaling 0