Statistical Measures of Variation

Section 1.5 Statistical Measures of Variation

These measures provide some indication of how much the data set is "spread out". Indeed, note that the data sets {-2,-1,0,1,2} and {-200,-100,0,100,200} have the same mean but one is much more spread out than the other. Measures of variation should catch this difference.

Definition 1.5.1 Range:

Using the order statistics,

$\begin{equation*} y_n - y_1. \end{equation*}$

It is trivial to note that the range is very easy to compute but it completely gnores all data values but the two ends.

From the Presidential data 1.3.2, the maximum is 69 and the minimum is 46 so the range is 23, the difference of these two.

Definition 1.5.2 Interquartile Range (IQR):

$P^{0.75} - P^{0.25}\text{.}$

For the data set {2, 5, 8, 10}, you have found that $Q_1 = 2.75$ and $Q_3 = 9.5\text{.}$ Therefore,

$\begin{equation*} IQR = 9.5 - 2.75 = 6.75. \end{equation*}$

Average Deviation from the Mean (Population): Given a population data set $x_1, x_2, ... , x_n$ with mean $\mu$ each term deviates from the mean by the value $x_k - \mu\text{.}$ So, averaging these gives

$\begin{equation*} \frac{\sum_{k=1}^n (x_k-\mu)}{n} = \frac{\sum_{k=1}^n x_k}{n} - \frac{\sum_{k=1}^n \mu}{n} = \mu - \mu = 0. \end{equation*}$

This metric is therefore always zero for any provided set of data since cancellation makes this not useful. So, we need to determine ways to avoid cancellation.

Average Absolute Deviation from the Mean (Population):

$\begin{equation*} \frac{\sum_{k=1}^n \left | x_k-\mu \right |}{n} \end{equation*}$

which, although nicely stated, is difficult to deal with algebraically since the absolute values do not simplify well algebraically. To avoid this algebraic roadblock, we can look for another way to nearly accomplish the same goal by squaring and then square rooting.

Average Squared Deviation from the Mean (Population):

$\begin{equation*} \frac{\sum_{k=1}^n ( x_k-\mu )^2}{n} \end{equation*}$

which will always be non-negative but can be easily expanded using algebra. Since this is a mouthful, this measure is generally called the "variance".

Using the average squared deviation from the mean, differences have been squared. Thus all of the squared differences added are non-negative but very small ones have been made even smaller and larger ones have been made relatively larger. To undo this scaling issue, one must take a square root to get things back into the right ball park.

Definition 1.5.3 Variance and Standard Deviation

The variance is the average squared deviation from the mean. If this data comes from the entire universe of possibilities then we call it a population variance and denote this value by $\sigma^2\text{.}$ Therefore

$\begin{equation*} \sigma^2 = \frac{\sum_{k=1}^n ( x_k-\mu )^2}{n} \end{equation*}$

The standard deviation is the square root of the variance. If this data comes from the entire universe of possibilities then we call it a population standard deviation and denote this value by $\sigma\text{.}$ Therefore

$\begin{equation*} \sigma = \sqrt{\frac{\sum_{k=1}^n ( x_k-\mu )^2}{n}}. \end{equation*}$

If data comes from a sample of the population then we call it a sample variance and denote this value by v. Since sample data tends to reflect certain "biases" then we increase this value slightly by $\frac{n}{n-1}$ to give the sample variance

$\begin{equation*} s^2 = \frac{n}{n-1}\frac{\sum_{k=1}^n ( x_k-\overline{x} )^2}{n} = \frac{\sum_{k=1}^n ( x_k-\overline{x} )^2}{n-1}. \end{equation*}$

and the sample standard deviation similarly as the square root of the sample variance.

From the data {2,5,8,10}, you have found that the mean is 6.25. Computing the variance then involves accumulating and averaging the squared differences of each data value and this mean. Then

$\begin{align*} & \frac{1}{4} \left ( (2-6.25)^2 + (5-6.25)^2 + (8-6.25)^2 + (10-6.25)^2 \right ) \\ & = \frac{18.0625 + 1.5625 + 3.0625 + 14.0625}{4} \\ & = \frac{36.75}{4}\\ & = 9.1875. \end{align*}$

Theorem 1.5.4 Alternate Forms for Variance

$\begin{align*} \sigma^2 & = \left ( \frac{\sum_{k=1}^n x_k^2 }{n} \right ) - \mu^2 \\ & = \left [ \frac{\sum_{k=1}^n x_k(x_k - 1)}{n} \right ] + \mu - \mu^2 \end{align*}$

Proof

\begin{align*} \sigma^2 & = \frac{\sum_{k=1}^n ( x_k-\mu )^2}{n}\\ & = \frac{\sum_{k=1}^n ( x_k^2 - 2x_k \mu + \mu^2 )}{n}\\ & = \frac{\sum_{k=1}^n x_k^2 - 2\mu \sum_{k=1}^n x_k + n \mu^2 )}{n}\\ & = \left ( \frac{\sum_{k=1}^n x_k^2 }{n} \right ) - \mu^2 \end{align*}

The second part is proved similarly. Using the first part of the proof above,

\begin{align*} \sigma^2 & = \frac{\sum_{k=1}^n ( x_k-\mu )^2}{n}\\ & = \left ( \frac{\sum_{k=1}^n x_k^2 }{n} \right ) - \mu^2\\ & = \left ( \frac{\sum_{k=1}^n x_k (x_k - 1) + x_k }{n} \right ) - \mu^2\\ & = \left ( \frac{\sum_{k=1}^n x_k (x_k - 1)}{n} \right ) + \mu - \mu^2 \end{align*}

Example 1.5.5 Computing means and variances by hand

In the data table below, notice that the $x_k$ column would be the given data values but the column for $x_k^2$ you could easily compute.

$x_k$	$x_k^2$
1	1
-1	1
0	0
2	4
2	4
5	25

Table 1.5.6 Sample Grouped Data

So, $\Sigma x_k = 9$ and $\Sigma x_k^2 = 35\text{.}$ Therefore $\overline{x} = \frac{9}{6} = \frac{3}{2}$ and $v = \frac{\Sigma x_k^2}{6} - (\overline{x})^2 = ( \frac{35}{6} - \frac{3}{2}^2 = \frac{70-18}{12} = \frac{26}{6}\text{.}$ Therefore, $s^2 = \frac{6}{5} \times v = \frac{26}{5}\text{.}$

Use R to compute these values...

xxxxxxxxxx
 
data <- c( 1, -1, 0, 2, 2, 5)   # concatenate the following items into a list
paste("Variance = ", var(data))
paste("Standard Dev = ", sd(data))
paste("Inter Quantile Range =",IQR(data))
paste("Box and Whisker Diagram:")
boxplot(data, horizontal=TRUE)

Once again, the Population of the individual USA states according to the 2013 Census is considered below.

Checkpoint 1.5.7 USA State Population Measures of Variation

Using the US Census Bureau state populations 1.4.5 (in millions) for 2014 provided earlier, determine the range, quartiles, and variance for this sample data.

Solution

Again, you should note that these are already in order so the range is quickly found to be

\begin{equation*} y_n - y_1 = 38.3 - 0.6 = 37.7 \end{equation*}

million residents.

For IQR, we first must determine the quartiles. The median (found earlier) already is the second quartile so we have $Q_2 = 4.5$ million. For the other two, the formula for computing percentiles gives you the 25th percentiile

\begin{gather*} (n+1)p = 51(1/4) = 12.75\\ Q_1 = P^{0.25} = 0.25 \times 1.9 + 0.75 \times 2.1 = 2.05 \end{gather*}

and the 75th percentile

\begin{gather*} (n+1)p = 51(3/4) = 38.25\\ Q_3 = P^{0.75} = 0.75 \times 7 + 0.25 \times 8.3 = 7.325. \end{gather*}

Hence, the IQR = 7.325 - 2.05 = 5.275 million residents.

From the computation before, again note that n=51 since the District of Columbia is included. The mean of this data found before was found to be approximately 6.20 million residents. So, to determine the variance you may find it easier to compute using the the alternate variance formulas Theorem 1.5.4.

\begin{align*} v & = \left ( \frac{\sum_{k=1}^n y_k^2 }{n} \right ) - \mu^2\\ & \approx \frac{4434.37}{51} - (6.20)^2\\ & = 48.51 \end{align*}

and so you get a sample variance of

\begin{equation*} s^2 \approx \frac{51}{50} \cdot 48.51 = 49.48 \end{equation*}

and a sample standard deviation of

\begin{equation*} s \approx \sqrt{49.48} \approx 7.03 \end{equation*}

million residents.

The state population data set has been entered for you in the R cell below...

xxxxxxxxxx
 
data <- c (0.6,0.6,0.6,0.7,0.7,0.8,0.9,1,1.1,1.3,1.3,1.4,1.6,
1.9,1.9,2.1,2.8,2.9,2.9,3,3,3.1,3.6,3.9,3.9,4.4,4.6,
4.8,4.8,5.3,5.4,5.7,5.9,6,6.5,6.6,6.6,6.7,7,8.3,
8.9,9.8,9.9,10,11.6,12.8,12.9,19.6,19.7,26.4,38.3)
paste("Variance = ", var(data))
paste("Standard Dev = ", sd(data))
paste("Inter Quantile Range =",IQR(data))

\(x_k\)	\(x_k^2\)
1	1
-1	1
0	0
2	4
2	4
5	25